The Tsetlin Machine (TM)  is a novel machine learning paradigm introduced in 2018. It is based on the Tsetlin Automaton (TA) , one of the pioneering solutions to the well-known multi-armed bandit problem [20, 8] and the first Finite State Learning Automaton (FSLA) . The TM has the following main properties that make it attractive as a building block for machine learning : (a) it solves complex pattern recognition problems with interpretable propositional formulae. This is crucial for high stakes decisions . (b) It learns on-line, persistently increasing both training- and test accuracy, before converging to a Nash equilibrium. The Nash equilibrium balances false positive against false negative classifications, while combating overfitting with frequent itemset mining principles. (c) Resource allocation dynamics are leveraged to optimize usage of limited pattern representation resources. By allocating resources uniformly across sub-patterns, local optima are avoided. (d) Inputs, patterns, and outputs are expressed as bits, while recognition and learning rely on straightforward bit manipulation. This facilitates low-energy consuming hardware design. (e) Finally, the TM has provided competitive accuracy in comparison with classical and neural network based techniques, while keeping the important property of interpretability.
The TM currently is state-of-the-art in FSLA-based pattern recognition. However, when compared with CNNs, it struggles with attaining competitive accuracy, providing e.g. % mean accuracy on MNIST (without augmenting the data) . To address this deficiency, we here introduce the Convolutional Tsetlin Machine (CTM), a new kind of TM designed for image classification.
FSLA. The simple Tsetlin Automaton approach has formed the core of more advanced FSLA designs that solve a wide range of problems. This includes decentralized control , equi-partitioning , streaming sampling for social activity networks , faulty dichotomous search , and learning in deceptive environments , to list a few examples. The unique strength of all of these FSLA designs is that they provide state-of-the-art performance when problem properties are unknown and stochastic, and the problem must be solved as quickly as possible through trial and error.
Rule-based Machine Learning. While the present paper focuses on extending the field of FSLA, we acknowledge the extensive work on rule-based interpretable pattern recognition from other fields of machine learning. Learning propositional formulae to represent patterns in data has a long history. One prominent example is frequent itemset mining for association rule learning , for instance applied to predicting sequential events [22, 16]. Other examples include the work of Feldman who investigated the hardness of learning formulae in disjunctive normal form (DNF) 
. Furthermore, Probably Approximately Correct (PAC) learning has provided fundamental insight into machine learning, as well as a framework for learning formulae in DNF. Approximate Bayesian approaches have recently been introduced to provide more robust learning of rules [26, 10]
. However, in general, rule-based machine learning scales poorly and is prone to noise. Indeed, for data-rich problems, in particular those involving natural language and sensory inputs, state-of-the-art rule-based machine learning is inferior to deep learning. The CTM attempts to bridge the gap between the interpretability of rule-based machine learning and the accuracy of deep learning, by allowing the TM to more effectively deal with images.
A myriad of image recognition techniques have been reported. However, after AlexNet won the ImageNet recognition challenge by a significant margin in 2012, the entire field of computer vision has been dominated by CNNs. The AlexNet architecture was built upon earlier work by LeCun et al. who introduced CNNs in 1998. Countless CNN architectures, all following the same basic principles, have since been published, including the now state-of-the-art Squeeze-and-Excitation networks . In this paper, we introduce convolution to the TM paradigm of machine learning. By doing so, we simultaneously propose a new kind of convolution filter: an interpretable filter expressed as a propositional formula. Our intent is to address a well-known disadvantage of CNNs, namely, that the CNN models in general are complex and non-transparent, making them hard to interpret. Consequently, the knowledge on why CNNs perform so well and what steps are need to improve the models is limited .
Relying on a large number of multiply-accumulate operations, training CNNs is computationally intensive. To mitigate this, there is an increasing interest in binarizing the CNNs. With only two possible values for the synapse weights, e.g.,and , many of the multiplication operations can potentially be replaced with simple accumulations. This could potentially open up for more specialized hardware and more compressed and efficient models. One recent approach is BinaryConnect , which reached a near state-of-the-art accuracy of % on MNIST, and
% in combination with an SVM. Binary CNNs have been further improved with the introduction of XNOR-Nets, which replace the standard CNN filters with binary equivalents. BNN+ 
is the most recent binary CNN, extending the XNOR-Nets with two additional regularization functions and an adaptive scaling factor. The CTM can be seen as an extreme Binary CNN in the sense that it is entirely based on fast summation and logical operators. Additionally, learning in the CTM is bandit based (online learning from reinforcement), while Binary CNNs are based on backpropagation.
Contributions and Paper Outline. Our contributions can be summarized as follows. First, in Section 2, we provide a brief introduction to the TM and a succinct definition of both recognition and learning. Then, in Section 3, we introduce the concept of convolution to TMs. Whereas the classic TM categorizes an image by employing each clause once on the whole image, the CTM uses each clause as a convolution filter. In all brevity, we propose a recognition and learning approach for clause-based convolution that produces interpretable filters. In Section 4, we evaluate the CTM on MNIST, Kuzushiji-MNIST, Fashion-MNIST, and the 2D Noisy XOR Problem, and discuss the merits of the new scheme. Finally, we conclude and provide pointers to further work in Section 5.
2 The Tsetlin Machine
The TM solves complex pattern recognition problems with conjunctive clauses
in propositional logic, composed by a collective of TAs. It is rooted in game theory, the bandit problem, FSLA games, resource allocation, and frequent itemset mining. The rationale behind the TM can be found in full depth in the original paper on the TM (the paper also includes pseudo code). Despite the rather complex interplay of several research areas, both recognition and learning in the TM can be defined succinctly as follows.
Structure. The structure of the TM is shown in Figure 0(a)
. The TM takes a vectorof propositional variables as input: , where is size of the vector. The input is fed to multiple conjunctive clauses, each of which has the form: . Here, and are non-overlapping subsets of the input variable indexes, . The subsets decide which portion of the input variables are Included in the clause (hence the upper index ), and whether they are negated or not. Conversely, we will in the following denote the variables that are Excluded with . To refer to a specific clause , we will add a lower index . The input variables from are included as is, while the input variables from are negated. Jointly, the non-negated and negated variables are referred to as literals. The output of a clause is accordingly fully specified by the input and the selection of variables in and . In the TM, each clause is further assigned a fixed polarity , which decides whether the clause output is negative or positive. Positive clauses are used to recognize class , while negative clauses are used to recognize class . Finally, a summation operator aggregates the output of the clauses, while a threshold function decides the class predicted for the input .111Multi-class pattern recognition problems are modelled by employing multiple instances of this structure, replacing the threshold operator with an argmax operator .
Recognition. To effectively deal with negated variables, the first step of inference consists of constructing an augmented input vector by concatenating with negated:
The next step is to introduce the collective of TA, which decides the composition of each clause. Assuming a structure with clauses, half of the clauses are given negative polarity, while the other half is given positive polarity. We here only define the operations on clauses with positive polarity. The definition for the negative ones is equivalent.
First of all, there are TAs per clause , one for each input . Each individual TA has a state , and the state of all of the TAs are organized in an matrix ( is divided by two since we only consider positive clauses): . The states decide the actions taken by the TAs. That is, for each clause , some variables are Excluded: , and some are Included: .
To calculate the output of the clauses, we also need the indexes of the input variables of value : . Here, the lower index refers to the set of input variables and the upper index refers to the variable value used for selection. The output of each positive clause, organized as a vector , can then be calculated as follows:
Next, we sum the output of the positive and negative clauses: (ignoring empty clauses post-learning). Finally, a threshold decides the prediction of the TM: .
Learning. Learning in the TM is based on coordinating the collective of TAs using a novel FSLA game that leverages resource-allocation and frequent itemset mining principles . The TM employs two kinds of feedback to provide reinforcement to the TAs: Type I and Type II. Type I feedback combats false negatives, while Type II feedback suppresses false positives. In all brevity, feedback is handed out based on training examples , consisting of an input and an output . As illustrated by the feedback loop in Figure 0(a) (Bandit Learning), feedback is controlled by the sum and a target value set for by the user. A larger (with a corresponding increase in the number of clauses) makes the learning more robust. This is because more clauses are involved in learning each specific pattern, introducing an ensemble effect.
For training output , TAs belonging to positive clauses are given Type I feedback to make approach . A matrix picks out the positive clauses selected for feedback (the lower index of refers to the output ):
Here, the lower index of refers to the clause and the output . As seen, clauses are randomly selected for feedback, with a larger chance of being selected for lower . We now divide the Type I feedback into two parts, Type Ia and Type Ib. Type Ia reinforces Include actions to capture patterns within . Only TAs taking part in clauses that output 1 and whose associated variable takes the value 1 are select for Type Ia feedback: . Type Ib feedback is designed to reinforce Exclude actions to combat over-fitting. Type Ib feedback is handed out to the TAs stochastically, using a user set parameter (a larger provides finer patterns). The stochastic part of the calculation is organized in a matrix with entries:
Here, the lower index of refers to the clause and the TA , considering the positive clauses. TAs are selected for Type Ib feedback as follows: . That is, TAs taking part in clauses that output , or whose associated variable takes the value 0, are select for Type Ib feedback, however, only if stochastically selected by and . Finally, the states of the TAs are updated by increasing/decreasing the states of the selected TAs with two state update operators and : These operators add/subtract from the states of the singled out TAs, however, not beyond the given state space.
For training output , TAs belonging to positive clauses are given Type II feedback to suppress clause output (combats false positives). This, together with the negative clauses, make approach . A matrix picks out the positive clauses selected for feedback (the lower index of refers to the output ):
The lower index of refers to the clause and the output , respectively. Here too, clauses are randomly selected for feedback, with a larger chance of being selected for higher . Next, the TAs selected are the ones that will turn clauses that output into clauses that output : . Again, the states of the selected TAs are updated using the dedicated operators: . All of the above operations are for positive clauses. For negative clauses, Type I feedback is simply replaced with Type II feedback and vice versa!
3 The Convolutional Tsetlin Machine
Consider a set of images , where is the index of the images. Each image is of size , and consists of binary layers (which together encode the pixel colors using thresholding 
, one-hot encoding, or any other appropriate binary encoding). A classic TM models such an image with an input vectorthat contains propositional variables. Further, each clause is composed from literals. As illustrated in Figure 0(b), a clause then quite simply decides which bits of a given image layer must take the value , which must take the value , and which can be ignored, to match the clause. Inspired by the impact convolution has had on deep neural networks, we here introduce the Convolutional Tsetlin Machine (CTM).
Interpretable Rule-based Filters. The CTM uses filters with spatial dimensions , again with binary layers. Further, the clauses of the CTM take the role of filters. Each clause is accordingly composed from literals. Additionally, to make the clauses location-aware, we augment each clause with binary encoded coordinates. Location awareness may prove useful in applications where both patterns and their location are distinguishing features, e.g. recognition of facial features such as eyes, eyebrows, nose, mouth, etc. in facial expression recognition. In all brevity, when applying a filter of size on an image , the filter will be evaluated on image patches. Here, and , with being the step size of the convolution. Each image patch thus has a certain location within the image, and we augment the input vector with the coordinates of this location. We denote the resulting augmented input vector : . As seen, the input vector is extended with one propositional variable per position along each dimension, with the position being encoded using thresholding  or one-hot encoding. Figure 2(b) illustrates an example of the image, patches, and a filter for , , and . In this example, the 33 filter moves from left to right, from top to bottom, pixels per step.
Recognition. The CTM uses the classic TM procedure for recognition (see Section 2). However, for the CTM each clause outputs values per image (one value per patch), as opposed to a single output for the TM (Eq. 2). We denote the output of a positive clause on patch by . To turn the multiple outputs of clause into a single output denoted by , we simply OR the individual outputs:
Learning. Learning in the CTM leverages the TM learning procedure. As seen in Section 2, Type Ia, Type Ib, and Type II feedback are influencing each clause based on the content of the augmented input vector . For the CTM, the input vector is an image patch, and there are patches in an image. There is thus augmented inputs , , per clause. Therefore, to decide which patch to use when updating a clause, the CTM randomly selects a single patch among the patches that made the clause evaluate to . The clause is then updated according to this patch. That is, an augmented input is drawn from the set: . Observe that if the set is empty, only Type Ib feedback is applicable, and then the augmented input vector is not needed. For non-empty sets, the TAs to be updated are finally singled out using the randomly selected patch:
The reason for randomly selecting a patch is to have each clause extract a certain sub-pattern, and the randomness of the uniform distribution statistically spread the clauses for different sub-patterns in the target image. Finally, observe that the computational complexity of the CTM grows linearly with the number of clauses, and with the number of patches . However, the computations can be easily parallelized due to their decentralized nature.
Step-by-step Walk-through of Inference on Noisy 2D XOR. Rather than providing hand-crafted features which can be used for image classification, the CTM learns feature detectors. We will explain the workings of the CTM by an illustrative example of noisy 2D XOR recognition and learning (see Figure 3 and Section 4). Consider the CTM depicted in Figure 1(a). It consists of four positive clauses which represent XOR patterns that must be present in a positive example image (positive features) and four negative clauses which represent patterns that will not trigger a positive image classification (negative features). The number of positive and negative clauses is a user-defined parameter. The bit patterns inside each clause are represented by the output of four TA, one for each bit in a filter.
Consider the image shown in Figure 1(b). The filter represented by the second positive clause matches the patch in the top-right corner of the image and it is the only clause with output ; similarly, none of the negative clauses respond since their patterns do not match the pattern found in the current patch (Figure 1(b)). Thus, the Tsetlin Machine’s combined output is . Learning of feature detectors proceeds as follows: With the CTM’s threshold value set to , the probability of feedback is , and thus learning taking place, which pushes the CTM’s output towards . Note that Type I feedback reinforces true positive output and reduces false negative output whereas Type II feedback reduces false positive output.
A subsequent state of the CTM is shown in Figure 2(a). Note that there are now two positive clauses which detect their pattern in the top-right corner patch. The combined output of all clauses is ; thus, no further learning is necessary for the detection of the XOR pattern in this patch. Also, the location of the occurrence of each pattern is included. The location information uses a bit representation as follows: Suppose an XOR pattern occurs at the three X-coordinates , , and . For the corresponding binary location representation, these coordinates are considered thresholds: If a coordinate is greater than a threshold, then the corresponding bit in the binary representation will be ; otherwise, it is set to . Thus, the representation of the X-coordinates , , and will be ‘111’, ‘011’ and ‘001’, respectively. These representations of the location of patterns are also learned by TAs.
4 Empirical Results
In this section, we evaluate the CTM on four different datasets :
2D Noisy XOR. The 2D Noisy XOR dataset contains binary images, training examples and test examples. The image bits have been set randomly, expect for the patch in the upper right corner, which reveals the class of the image. A diagonal line is associated with class , while a horizontal or vertical line is associated with class
. Thus the dataset models a 2D version of the XOR-relation. Furthermore, the dataset contains a large number of random non-informative features to measure susceptibility towards the curse of dimensionality. To examine robustness towards noise we have further randomly invertedof the outputs in the training data.
MNIST. The MNIST dataset has been used extensively to benchmark machine learning algorithms, consisting of grey scale images of written digits .
Kuzushiji-MNIST. This dataset contains grayscale images of Kuzushiji characters, cursive Japanese. Kuzushiji-MNIST is more challenging than MNIST because there are multiple distinct ways to write some of the characters .
Binary Fashion-MNIST. This dataset contains grayscale images of articles from the Zalando catalogue, such as t-shirts, sandals, and pullovers . This dataset is quite challenging, with a human accuracy of %. We binarize these data by thresholding on . We selected a rather low grey value, to capture the complete shape of the articles.
The latter three datasets contain training examples and test examples. We augmented MNIST with training images using InfiMNIST  and KMNIST with training images using a random scaling factor in the range . Further, we encoded the pixel values using four bits () based on uniformly distributed thresholds. Table 2 reports test accuracy for the CTM, while Table 1 contains the corresponding configurations.
|Search Range||2D Noisy XOR||MNIST||K-MNIST||Fashion-MNIST|
|4-Nearest Neighbour [2, 27]|
|SVM [2, 27]|
|Simple CNN |
|FPGA-accelerated BNN ||-||-||-|
|CTM (95 %ile)|
|ResNet18 + VGG Ensemble ||-||-|
Single-run accuracy per epoch for CTM on (a) MNIST and (b) K-MNIST.
The hyperparameters were found using a grid search for the given ranges. The results are based on ensemble averages, obtained from the last epochs of , with replications of each experiment. The CTM performs rather robustly from run to run, providing tight
% confidence intervals for the mean performance. While this evaluation focuses on the CTM, we have also included results for selected popular algorithms, as points of reference. Results listed in italic are reported in the corresponding papers. Results for BinaryConnect and FPGA-accelerated BNNs on K-MNIST and Fashion-MNIST were not available, so are not reported. Notice that the CTM outperforms the binary CNNs on MNIST, as well as a simple 4-layer CNN, an SVM and a 4-nearest neighbour configuration. However, it is outperformed by the more advanced deep learning network architectures PreActResNet-18 and ResNET18+VGG. Figure3(a) depicts training and test accuracy for the CTM on MNIST, epoch-by-epoch, in a single run. Test accuracy peaks at % after epochs and % after epochs. Figure 3(b) contains corresponding results for Kuzushiji-MNIST. Here, test accuracy peaks at % after epochs and % after epochs. Further, test accuracy climbs quickly in the first epochs, passing % already in epoch for MNIST. For both datasets, training accuracy approaches %.
5 Conclusion and Further Work
This paper introduced the Convolutional Tsetlin Machine (CTM), leveraging the learning mechanism of the Tsetlin Machine (TM). Whereas the TM categorizes images by employing each clause once per input, the CTM uses each clause as a convolution filter. The filters learned by the CTM are interpretable, being formulated using propositional formulae (see Figure 0(b)). To make the clauses location-aware, each patch is further enhanced with its coordinates within the image. Location awareness may prove useful in applications where both patterns and their location are distinguishing features, e.g. recognition of facial features such as eyes, eyebrows, nose, mouth, etc. in facial expression recognition. By randomly selecting which patch to learn from, the standard Type I and Type II feedback of the classic TM can be employed directly. In this manner, the CTM obtains results on MNIST, Kuzushiji-MNIST, Fashion-MNIST, and the 2D Noisy XOR Problem that compares favorably with simple 4-layer CNNs as well as two binary neural network architectures.
In our further work, we intend to investigate more advanced binary encoding schemes, to go beyond grey-scale images (e.g., addressing CIFAR-10 and ImageNet). We further intend to develop schemes for deeper CTMs, with the first step being a two-layer CTM, to introduce more compact and expressive patterns with nested propositional formulae.
-  R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Rec., 22(2):207–216, 1993.
-  T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep Learning for Classical Japanese Literature. arXiv:1812.01718, Dec 2018.
-  M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
-  S. Darabi, M. Belbahri, M. Courbariaux, and V. Partovi Nia. BNN+: Improved Binary Network Training. arXiv:1812.11800, Dec 2018.
-  K. Darshana Abeyrathna, O.-C. Granmo, X. Zhang, and M. Goodwin. A Scheme for Continuous Input to the Tsetlin Machine with Applications to Forecasting Disease Outbreaks. arXiv:1905.04199, May 2019.
-  V. Feldman. Hardness of Approximate Two-Level Logic Minimization and PAC Learning with Membership Queries. Journal of Computer and System Sciences, 75(1):13–26, 2009.
-  M. Ghavipour and M. R. Meybodi. A streaming sampling algorithm for social activity networks using fixed structure learning automata. Applied Intelligence, 2018.
-  J. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B (Methodological), 41(2):148–177, 1979.
-  O.-C. Granmo. The Tsetlin Machine - A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic. arXiv:1804.01508, Apr 2018.
-  J. R. Hauser, O. Toubia, T. Evgeniou, R. Befurt, and D. Dzyabura. Disjunctions of Conjunctions, Cognitive Simplicity, and Consideration Sets. Journal of Marketing Research, 47(3):485–496, 2010.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
-  C. Lammie, W. Xiang, and M. Rahimi Azghadi. Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL. arXiv:1905.06105, May 2019.
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica. Tune: A Research Platform for Distributed Model Selection and Training. arXiv e-prints, page arXiv:1807.05118, Jul 2018.
G. Loosli, S. Canu, and L. Bottou.
Training invariant support vector machines using selective sampling.In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA., 2007.
-  T. McCormick, C. Rudin, and D. Madigan. A Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction. Annals of Applied Statistics, 2011.
-  K. S. Narendra and M. A. L. Thathachar. Learning Automata: An Introduction. Prentice-Hall, Inc., 1989.
-  B. J. Oommen and D. C. Ma. Deterministic Learning Automata Solutions to The Equipartitioning Problem. IEEE Transactions on Computers, 37(1):2–13, 1988.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
-  H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952.
-  C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
-  C. Rudin, B. Letham, and D. Madigan. Learning theory analysis for association rules and sequential event prediction. Journal of Machine Learning Research, 14:3441–3492, 2013.
-  M. L. Tsetlin. On behaviour of finite automata in random medium. Avtomat. i Telemekh, 22(10):1345–1354, 1961.
-  B. Tung and L. Kleinrock. Using Finite State Automata to Produce Self-Optimization and Self-Control. IEEE Transactions on Parallel and Distributed Systems, 7(4):47–61, 1996.
-  L. G. Valiant. A Theory of the Learnable. Communications of the ACM, 27(11):1134–1142, 1984.
-  T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu, E. Klampfl, and P. MacNeille. A Bayesian Framework for Learning Rule Sets for Interpretable Classification. The Journal of Machine Learning Research, 18(1):2357–2393, 2017.
-  H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017.
-  A. Yazidi and B. John Oommen. On the analysis of a random walk-jump chain with tree-based transitions and its applications to faulty dichotomous search. Sequential Analysis, 37:31–46, Jan 2018.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
-  J. Zhang, Y. Wang, C. Wang, and M. Zhou. Symmetrical Hierarchical Stochastic Searching on the Line in Informative and Deceptive Environments. IEEE Transactions on Cybernetics, 47(3):626 – 635, Jul 2016.