A Tsetlin Machine with Multigranular Clauses

09/16/2019 ∙ by Saeed Rahimi Gorji, et al. ∙ Universitetet Agder 0

The recently introduced Tsetlin Machine (TM) has provided competitive pattern recognition accuracy in several benchmarks, however, requires a 3-dimensional hyperparameter search. In this paper, we introduce the Multigranular Tsetlin Machine (MTM). The MTM eliminates the specificity hyperparameter, used by the TM to control the granularity of the conjunctive clauses that it produces for recognizing patterns. Instead of using a fixed global specificity, we encode varying specificity as part of the clauses, rendering the clauses multigranular. This makes it easier to configure the TM because the dimensionality of the hyperparameter search space is reduced to only two dimensions. Indeed, it turns out that there is significantly less hyperparameter tuning involved in applying the MTM to new problems. Further, we demonstrate empirically that the MTM provides similar performance to what is achieved with a finely specificity-optimized TM, by comparing their performance on both synthetic and real-world datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Tsetlin Machine (TM) is a new machine learning algorithm that was introduced in 2018 [9]. It leverages the ability of so-called learning automata (LA) to learn the optimal action in unknown stochastic environments [10]. The TM has provided competitive pattern recognition accuracy in several benchmarks, without losing the important property of interpretability [9].

The TM builds upon a long tradition of LA research, involving cooperating systems of LA [14, 13, 12, 11]. More recently, LA have been combined with cellular automata (CA), where each CA cell contains one or more LA, which learn in a distributed fashion [6, 3, 16]

. Some noteworthy LA-based classifiers are further introduced in

[4, 7, 1, 18, 2, 17]. However, these approaches mainly tackle small-scale pattern classification problems.

In all brevity, a TM consists of teams of Tsetlin Automata (TA) [15]

that interact to solve complex pattern recognition problems. It takes a binary feature vector

as input, which is further processed by conjunctive clauses and . Each clause captures a specific sub-pattern, formulated as a conjunction of literals (binary features and their negations): . Half of the clauses are assigned positive polarity. These describe sub-patterns for output . The other half is assigned negative polarity, describing sub-patterns for output . The output is thus simply decided by a majority vote: .

During learning, each team of TA is responsible for a specific clause. There are two TA per feature . One decides whether to include in the clause, while the other decides upon including . These decisions are updated based on reinforcement derived from training examples , contrasting the current clauses against (see [9] for further details).

Learning in the TM is governed by three hyperparameters: number of clauses , specificity , and voting target , all set by the user [9]. The number of clauses decides the overall capacity of the TM to represent patterns, with each clause capturing a particular facet of the data. Specificity , in turn, is used by the TM to control the granularity of the clauses, playing a similar role as so-called support in frequent itemset mining. Finally, the voting target produces an ensemble effect by stimulating up to clauses to output for each input, but not more than . This drives the clauses to distribute themselves uniformly across the patterns present in the data, avoiding local optima. In this paper, we will divide by the number of clauses , to obtain a target value relative to the number of clauses.

2 A Tsetlin Machine with multigranular clauses

We now introduce the Multigranular Tsetlin Machine (MTM) with the goal of eliminating specificity as hyperparameter. Specificity controls how fine-grained patterns the TM seeks, and it is thus crucial to set this parameter correctly to maximize the accuracy of the resulting classifier. A poor choice for can easily result in inferior accuracy.

While is a global hyperparameter for the TM, to be set by the user, the MTM instead assigns a unique -value local to each clause , . In all brevity, we define a fixed range for and then assign a value decided by the clause index :

As seen, specificity values are decreasing linearly with the clause index . In this paper, we use the range to , which covers a wide range of both coarse and very fine patterns, as this range performs robustly across all of our experiments.

The above multigranular approach has two crucial effects. First, one avoids the need for finding a suitable value for . Experimenting with different -values can be computationally expensive, in particular for large datasets. Secondly, patterns of diverse frequencies can more easily be captured by the clauses when the clauses themselves reflect the diversity of the patterns. Indeed, the classic TM may in the worst case spend an unnecessary large amount of clauses to capture frequent patterns, when has been set to also capture less frequent patterns. This in turn may potentially clutter some clauses with unnecessary literals, making them less readable (of course, these unnecessary literals may also be pruned in a post-processing phase, but at a higher computational cost during learning). As an example, assume the classic TM tries to capture the pattern of frequency , with an -value of . In this case, the TM will potentially add two extra literals to the target pattern, introducing e.g. . Now, to capture the pattern , the TM must spend four clauses instead of one, that is, one clause per value configuration of and .

3 Experimental results

In this section, we present experimental results examining how multigranular clauses affect accuracy and learning speed, in comparison with the classic TM. For the classic TM algorithm, we used a grid search to find the best -values as well as the threshold parameters. For MTM, however, we only needed to find an appropriate threshold value, using -values in the form of an arithmetic progression from to .

In our first experiment, we consider a problem that intermixes two kinds of patterns of different complexity. In brief, we specify patterns using binary variables . The patterns for output are either or the more elaborate , while the patterns for output are either or .

Clauses s Threshold TM (200) TM (500) Threshold MTM (200) MTM (500)
10 110 0.1 75.7% 78.2% 0.16 76.1% 78.0%
20 100 0.06 76.6% 78.2% 0.08 78.8% 78.4%
50 50 0.04 88.4% 89.2% 0.04 88.5% 88.2%
100 60 0.03 94.3% 95.9% 0.02 93.2% 95.2%
500 35 0.01 97.8% 98.0% 0.01 98.0% 98.0%
Table 1:

Accuracy after 200 and 500 epochs for TM and MTM on artificial data.

Both the training and test sets consist of 300 randomly generated examples and approximately of the examples fall within each of the four patterns. Table 1 shows the final accuracy for the TM and the MTM after 200 and 500 epochs, averaged over independent experiment runs, alongside the hyper-parameter values that led to that result. As seen, both algorithms exhibit similar performances for different number of clauses, however, MTM did not require tuning of .

Figure 1: The Tsetlin machine’s performance with 100 clauses after 500 epochs
Figure 2: The Tsetlin machine’s performance with 500 clauses after 500 epochs

Fig. 2 and 2 depict accuracy as a function of the and the threshold parameters. As seen, finding high-performing hyperparameter values is not trivial, with the search space varying with the number of clauses employed. In contrast, MTM is optimized only with respect to the threshold.

In our second experiment, we evaluate performance on the Iris flower dataset111https://archive.ics.uci.edu/ml/datasets/iris [5]. This dataset contains measurements for three classes of flowers, 50 instances of each. Each instance consists of four real-valued features. We used five bits to represent each real number (three and two bits for the integer and fractional parts, respectively). We further employed random 80%-20% training-test splits to increase the robustness of the evaluation. The results reported in Table 2 are the average performance of independent experiment runs, for each training-test split. Fig. 4 and 4 capture the difficulty of finding suitable values for the hyperparameters, while Table 2 shows how the MTM attains slightly lower accuracy compared to the classic TM, however, by only fine-tuning the threshold value.

Epoch TM (100 clauses) TM (500 clauses) MTM (100 clauses) MTM (500 clauses)
100 95.1% 95.5% 94.2% 95.0%
200 95.3% 95.6% 94.5% 94.6%
300 95.1% 95.7% 94.5% 94.9%
500 95.2% 95.7% 94.7% 95.0%
Table 2: The accuracy of TM and MTM on the binary Iris dataset
Figure 3: The Tsetlin machine’s performance with 100 clauses after 500 epochs
Figure 4: The Tsetlin machine’s performance with 500 clauses after 500 epochs

Further experiments can be found in the unabridged version of this paper [8].

4 Conclusion

In this work we introduced the multigranular Tsetlin Machine (MTM) to reduce the complexity of the hyperparameter search in Tsetlin Machine (TM) based learning. We achieved this by eliminating the specificity hyperparameter , instead introducing clauses with unique and diverse local -values. In our empirical results, it turns out that we actually can obtain similar accuracy as a finely optimized classic TM, however, eliminating the need to consider . Furthermore, we explored the capability of the MTM to capture patterns of diverse frequencies by using an artificial dataset.

As further research, a natural next step is to work on the theoretical aspects of MTM. Although the theoretical convergence results for TM also should hold for MTM, this needs to be investigated more rigorously. Furthermore, other interesting areas of research could be mechanisms for improving convergence speed. Finally, we intend to investigate the possibility of eliminating the other two remaining hyperparameters as well, making the TM completely parameter-free.

References

  • [1] Afshar, S., Mosleh, M., Kheyrandish, M.: Presenting a new multiclass classifier based on learning automata. Neurocomputing 104, 97–104 (2013)
  • [2] Aghaebrahimi, M., Zahiri, S., Amiri, M.: Data mining using learning automata. World Acad. Sci. Eng. Technol 49, 343–351 (2009)
  • [3] Ahangaran, M., Taghizadeh, N., Beigy, H.: Associative cellular learning automata and its applications. Applied Soft Computing 53, 1–18 (2017)
  • [4] Barto, A.G., Anandan, P.: Pattern-recognizing stochastic learning automata. IEEE Transactions on Systems, Man, and Cybernetics (3), 360–375 (1985)
  • [5] Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml
  • [6] Esmaeilpour, M., Naderifar, V., Shukur, Z.: Cellular learning automata approach for data classification. International Journal of Innovative Computing, Information and Control 8(12), 8063–8076 (2012)
  • [7]

    Goodwin, M., Yazidi, A., Jonassen, T.M.: Distributed learning automata for solving a classification task. In: 2016 IEEE congress on evolutionary computation (CEC). pp. 3999–4006. IEEE (2016)

  • [8] Gorji, S.R., Granmo, O.C., Phoulady, A., Goodwin, M.: A Tsetlin Machine with Multigranular Clauses and its Applications. Unabridged journal version of this paper. To be submitted. (2019)
  • [9] Granmo, O.C.: The Tsetlin Machine - A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic. arXiv preprint arXiv:1804.01508 (2018)
  • [10] Narendra, K., Thathachar, M.: Learning Automata: An Introduction. Prentice-Hall International (1989), https://books.google.no/books?id=ljphQgAACAAJ
  • [11] Rahnamazadeh, A., Meybodi, M.R., Kadkhoda, M.T.: Node classification in social network by distributed learning automata. Information Systems & Telecommunication p. 111 (2017)
  • [12] Sastry, P., Nagendra, G., Manwani, N.: A team of continuous-action learning automata for noise-tolerant learning of half-spaces. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40(1), 19–28 (2009)
  • [13] Sastry, P., Thathachar, M.: Learning automata algorithms for pattern classification. Sadhana 24(4-5), 261–292 (1999)
  • [14] Thathachar, M.A., Sastry, P.S.: Learning optimal discriminant functions through a cooperative game of automata. IEEE Transactions on Systems, Man, and Cybernetics 17(1), 73–85 (1987)
  • [15] Tsetlin, M.L.: On behaviour of finite automata in random medium. Avtom I Telemekhanika 22(10), 1345–1354 (1961)
  • [16] Uzun, A.O., Usta, T., Dündar, E.B., Korkmaz, E.E.: A solution to the classification problem with cellular automata. Pattern Recognition Letters 116, 114–120 (2018)
  • [17] Zahiri, S.H.: Learning automata based classifier. Pattern Recognition Letters 29(1), 40–48 (2008)
  • [18] Zahiri, S.H.: Classification rule discovery using learning automata. International Journal of Machine Learning and Cybernetics 3(3), 205–213 (2012)