1 Introduction
Structure learning deals with the problem where the output is a structured object rather than a valuedlabel bakir2007predicting . Structures used here include graph DBLP:conf/icml/ChenSYU15 , sequence DBLP:conf/nips/LiuT15 , trees DBLP:conf/sigir/WangMC09
, vectors
DBLP:journals/jmlr/LiuTM17 ; DBLP:conf/aaai/0001LTSO18, etc. Algorithms such as structured perceptron
DBLP:conf/emnlp/Collins02 and structured SVM DBLP:journals/jmlr/TsochantaridisJHA05 have also been proposed. During the last decades, structure learning has been successfully applied to object tracking DBLP:journals/pami/HareGSVCHT16 and location DBLP:conf/eccv/BlaschkoL08 , semantic parsing DBLP:conf/emnlp/PoonD09 , drug design lavecchia2015machine and websearch wang2018optimizing. There are other machine learning problems involving structures in the output space, for example, multilabel learning
DBLP:journals/tkde/ZhangZ14 , label ranking DBLP:journals/ai/HullermeierFCB08 ; DBLP:books/daglib/p/VembuG10 , and clustering DBLP:conf/nips/LiuST17 ; DBLP:conf/aaai/ShenLTSS17 . Such kind of problems are also highly related to structure learning.Generally, structure learning uses structures as supervision information and the corresponding algorithms target at achieving good performance. However, nowadays, as the learning algorithms become more and more complex, interpretability, i.e., understanding the inner mechanism or what takes place during the learning process, is also becoming important DBLP:journals/corr/Lipton16a . Thus besides using the structures as supervision information, can we learn a structure from existing models to increase the interpretability of complex models? In this paper, we will focus on the deep learning models, and try to learn structures from such models to improve its interpretability. Note that understanding deep learning models has raised great attention during the last several years DBLP:journals/corr/KarpathyJL15 ; DBLP:journals/corr/YosinskiCNFL15 ; DBLP:conf/aaai/WuHPZ0D18 . Thus it would be very beneficial if we can learn an interpretable structure from deep learning models.
Finding an interpretable structure for a deep learning model is generally difficult. However, for a specific type of deep learning models, i.e., Recurrent Neural Networks (RNNs) DBLP:journals/tnn/GoudreauGCC94 , there is a way. As a main member of deep neural networks, RNNs, especially those with gates (gated RNNs, such as MGU with one gate DBLP:journals/corr/ZhouWZZ16 , GRU with two gates DBLP:conf/emnlp/ChoMGBBSB14 and LSTM with three gates DBLP:journals/neco/HochreiterS97 ) have been successfully applied to various tasks on learning sequential data, such as speech recognition DBLP:journals/spm/X12a , image caption DBLP:conf/cvpr/VinyalsTBE15
DBLP:conf/emnlp/TangQL15 , etc. Apart from RNNs, there is another tool capable of processing sequential data, i.e. Finite State Automaton (FSA) nla.catvn710549 ; DBLP:journals/iandc/Gold78 ; DBLP:journals/csur/AngliunS83 . FSA is composed of finite states and transitions between states. It will transit from one state to another state in response to external sequential inputs. The transition process of FSA is similar to that of RNNs when both models accept items from some sequence one by one, and transit between states accordingly. Different from RNNs, the inner mechanism of FSA is easier to be interpreted since it can be simulated by human beings DBLP:journals/corr/Lipton16a where the transitions between states have physical meanings instead of numerical calculations in RNNs. Thus the characteristic of FSA makes us consider learning an FSA from RNNs and use the natural interpretation ability of FSA to understand RNNs’ inner mechanism. Therefore we adopt FSA as the interpretable structure that we look for. Different from the previous works on structure learning where the predictions or classification results are structured, the structured output in our paper is a middle outcome which is obtained to better understand RNNs’ inner mechanism.In order to learn an FSA from RNNs and use FSA to interpret the inner mechanism of RNNs, we need to answer two questions: how to learn and what to interpret. For the first question, to learn an FSA, we are inspired by the fact that hidden states of classical nongated RNN tend to form clusters DBLP:journals/nn/OmlinG96 ; DBLP:journals/neco/ZengGS93 . However, there are still important unsolved issues. One is that we do not know whether the tendency to form clusters will also hold for gated RNNs. We also need to consider the efficiency issue since gated RNNs nowadays are always applied to large data sets. When it comes to the second question, we need to analyze the role of the gate in gated RNNs. Especially considering the different number of gates in different gated RNNs, we should discuss the impact of the number of gates for them. In view of that transitions between states in FSA has physical meanings, we may infer the semantic meanings of RNNs’ transitions from corresponding transitions in FSA.
Note that in generic machine learning tasks, learning from multiple data resources DBLP:journals/tip/GongTMLKY16 ; DBLP:conf/aaai/Gong17 , or training several basic models and then combining them zhou2012ensemble usually produce better results. While in structure learning, it is also beneficial to incorporate multiple models DBLP:conf/nips/GuzmanRiveraBK12 where a set of multiple hypotheses is produced for experts to evaluate. Thus, besides learning only one FSA from RNNs, we also generate multiple FSAs to do ensemble to promote the performance. Furthermore, single structure may contain limited semantic information, whereas multiple structures might make the semantic information more plentiful and better to understand.
In this paper, we attempt to study RNNs through learning FSA from RNN. We first verify that besides RNN without gates, gated RNNs’ hidden states also have the natural tendency to form clusters. Then we propose two methods. One is based on the highefficient clustering methods kmeans++ DBLP:conf/soda/ArthurV07 . The other makes use of the observations that hidden states close in the same sequence also tend to be near in geometrical spaces, named as kmeansx. We then learn FSA by designing its five necessary elements, i.e., alphabet, a set of states, start state, a set of accepting state and state transitions. We apply our methods on artificial data and realworld data. For the artificial data, we first illustrate the learned FSA where human beings can follow and understand the running process. Then the results on the relationship between accuracy and the number of clusters inspire us that gates are necessary to gated RNNs, but the less gates the better. It explains why MGU (with only one gate) has merits over other gated RNNs to some extent. For the realworld data on sentiment analysis, we find that behind the numerical calculations, RNNs’ hidden states indeed have the capacity to distinguish semantic differences, when in the corresponding FSA, words leading to positive/negative outputs do have the positive/negative understandable emotions for human beings. For both datasets, we also produce multiple FSAs from RNNs to do ensemble by different initializations. The experimental results validate that multiple FSAs can improve the performance and make the semantic information more plentiful.
In the following, we are going to introduce background. Then we state our detailed algorithms, followed by experiments. Finally, we conclude our work.
2 Background
In this section, we introduce one nongated RNN and three gated RNNs, which will be studied in our paper. We also add discussions on interpretation in this section.
First, we introduce the classical nongated RNN. It was proposed in early 90s DBLP:journals/cogsci/Elman90 with simple structure which does not possess any gate and is only applied to small scale data. So we call it simple RNN (SRN). In general, SRN takes each element of a sequence as an input and combines the hidden state in the last time to calculate the current hidden state step by step. Concretely, at time , we input the th element of a sequence, saying into the hidden unit. Then the hidden unit will give the output based on the current input and the previous hidden state in the following way:
is usually defined as a linear transformation plus a nonlinear activation, e.g.,
where the matrix consists of parameters related to and and b is a bias term. The task of SRN is to learn the parameters and b.
MGU (minimal gated unit)  


GRU (gated recurrent unit) 



LSTM (long shortterm memory) 


is the logistic sigmoid function, and
is the componentwise product between two vectors.However, the data we are facing are growing bigger and bigger, thus we need deeper model DBLP:conf/nips/KrizhevskySH12 ; DBLP:books/daglib/0040158 to tackle this problem. Yet in this situation, SRN will suffer from the vanishing or exploding gradient issue, which makes learning SRN using gradient descent very difficult DBLP:journals/neco/HochreiterS97 ; DBLP:journals/tnn/BengioSF94 . Fortunately, gated RNNs are proposed to solve the gradient issue by introducing various gates to hidden unit to control how information flows in RNN. The two prevailing gated RNNs are Long Short Term Memory (LSTM) DBLP:journals/neco/GersSC00 and Gated Recurrent Unit (GRU) DBLP:conf/emnlp/ChoMGBBSB14 . LSTM has three gates including an input gate controlling adding of new information, a forget gate determining remembering of old information and an output gate deciding outputting of current information. GRU has two gates, an update gate and a reset gate which controls forgetting of old information and adding of new information, respectively, similar to the forget and input gate in LSTM.
The previous models add several gates to one hidden unit, producing a lot of additional parameters to tune and compute, thus may not be efficient enough. To tackle this, DBLP:journals/corr/ZhouWZZ16 produced Minimal Gated Unit (MGU), which only has a forget gate and has comparable performance with LSTM and GRU. Thus MGU’s structure is simpler, parameters are fewer and training and tuning are faster than the previous mentioned gated RNNs.
The mathematical formalizations of the three gated RNN models including MGU, GRU and LSTM mentioned above are summarized in Table 1, in which
(1) 
is the logistic sigmoid function (applied to every component of the vector input) and is the componentwise product between two vectors. All gates in Table 1 are marked with text “(gate)”, from which we can easily see that MGU has one gate, GRU has two gates and LSTM has three gates.
Note that although different gated RNN models with various gates added to hidden unit have been proposed, they are still difficult to be interpreted due to its complex inner mechanism. There are mainly three factors that cause the complexity of gated RNNs’ inner mechanism. One is its recurrent structure inherited from classical RNN DBLP:journals/tnn/GoudreauGCC94 . Despite that the recurrent structure has shown to be the key in handling sequential data, using the same unit recurrently for different inputs will make human beings confused about the inner process of classification. Another complexity comes from the gates they use on the unit. Although one of the reason why MGU is appreciated is that it uses far less gates than other models DBLP:journals/corr/ZhouWZZ16 , e.g., LSTM or GRU, the function of gates has not been fully understood, especially how many gates is inherently required for a gated RNN model. Thirdly, the inner process of gated RNNs is in the form of numerical calculation, while a numerical vector could not be directly associated to a concrete meaning for people to understand. In a word, gated RNNs’ inner mechanism is too complex for human beings to follow and understand.
In this paper, we will learn the interpretable structure, i.e., FSA to probe into the gated RNNs and attempt to make contributions on the interpretation. We will find that MGU with minimal number of gates still outperforms other RNNs from the FSA’s perspective. This may raise a new direction to design better RNN models.
3 Our Approach
In this section, we first introduce the intuition and framework, followed by the details of the proposed method including clustering hidden states and learning FSA.
3.1 Intuition and Framework
We consider the following case. First we train an RNN model on training data. Then two test sequences and are input to separately. It is reasonable to observe that if the two subsequences input to before time of and time of are analogous, the hidden states at time step of and of will also resemble each other. We regard a hidden state as a vector or a point. Thus when several sequences are input to RNN, large amounts of hidden state points will accumulate, and they tend to form clusters. To validate that, we show the distribution of hidden state points when testing from MGU, SRU, GRU and LSTM respectively in Figure 1 (a) to (d). We set the original dimension of hidden states by 10. Then we use tDistributed Stochastic Neighbor Embedding (tSNE) DBLP:journals/jmlr/Maaten09 to reduce the dimension of all 400 hidden state points from 10 to 2 so that we can plot them on the plane. As can be seen, all the hidden state points obtained from different RNNs tend to form clusters. We assume different clusters will represent different states and transitions between states arise when one item of input sequence is read in. Hence the network behaves like a state automaton. We assume the states are finite, then we can learn a Finite State Automaton (FSA) from a trained RNN model.
So the overall framework is showed in Figure 2. We firstly train RNNs on training data and then to do clustering on all hidden states corresponding to validation data and finally to learn an FSA with respect to . After obtaining an FSA, we can use it to test unlabeled data or directly draw an illustration. In the first step of training RNNs, we exploit the same strategy as DBLP:journals/corr/ZhouWZZ16 and omit the details here. In the following, we elaborate hidden state clustering and FSA learning steps.
3.2 Hidden States Clustering
The first clustering method we consider exploiting is kmeans hartigan1979algorithm . Kmeans is to minimize the average squared Euclidean distance of points from their cluster centers, which is efficient, effective and widely used. To obtain a robust result, we use a variant of kmeans named as kmeans++ DBLP:conf/soda/ArthurV07 which uses “ weighting” to weight and select cluster centers.
Nevertheless, directly using Euclidean distance may not be appropriate. Besides, it is reasonable to assume that the hidden state points in the same sequence are more similar, and the hidden state points that are close in time are also near in space. Thus, to consider this characteristic, we concatenate the original hidden state points with extra features which reflect the time closeness. We present an illustration as follows:
where means the th dimension of hidden state point . The dimension of the extra feature is the number of sequences in . Note that each element of a sequence corresponds to a hidden state. For the th element in the th sequence, the content in the th position of the extra feature is . We call the extra feature “extra position feature”. After altering the space, we still use kmeans++ to do clustering on the new space. We call this cluster method “kmeansx”.
3.3 Learning FSA
FSA is a 5tuple where is alphabet, meaning the set of the elements appearing in the input sequences, is a set of states, is the start state, is a set of accepting states and defines state transitions in . In order to learn an FSA, we will specify the details of how to design such five elements below.
In our case, we want to learn FSAs (Finite State Automata) from gated RNNs. The alphabet is easy to obtain from data. For example, if the data are sentences consisting of words, then is equal to all words in all sentences. So we have
(2) 
where means the vocabulary of .
Every time we input an element from some sequence into RNN, we can get the current hidden state given the previous hidden state . This process is similar to that we input a symbol from alphabet , and according to the current state and state transitions function , we would know which state should be transited to. Thus we can regard a cluster consisting of several similar hidden state points as a state in FSA. Then, the set of states are
(3) 
where is the cluster of a number of hidden states points .
We define the start state by an empty state without any hidden state point because when we input a word into RNN, no previous hidden states are given. Thus the start state is just a starting symbol. The accepting states
can be determined by the cluster center. Note that each state in FSA is a cluster of hidden state points in RNN. We use the RNN’s classifier to classify the cluster center of each state. If the classification result is positive, then the corresponding state is an accepting state, namely,
(4) 
The fifth element is the most difficult one to obtain among the five elements. We use transition matrix to represent the state transitions where means the number of elements in , means the set of integers ranging from to and means the number of symbols in . In , each row represents one state (the first row represents the start state , its serial number in is ), each column represents a symbol in alphabet. means state will transit to state when inputting a symbol whose corresponding hidden state point belongs to the th state. To get a transition matrix , we first need to calculate a matrix for each symbol in alphabet (e.g. 0 or 1 in binary alphabet), where the th entry represents the frequency of jumping from state to state given in all sequences, using the following steps:

indexing every cluster or state, associating each hidden state point to a state in FSA;

iterating through all hidden state points, and increasing by one when incurs a transition from state to state .
As a consequence, represents the transition times from state to when inputting . In this case, when inputting , state may transit to several states. We intend to obtain a deterministic FSA for clear illustrating, so we only keep the biggest value which means abandoning the less frequent transitions in each row of . Then the transition matrix can be quickly calculated as follows:
(5) 
We can draw an illustration of FSA according to and use to do classification. When doing classification, the state will keep jumping from one state to another in response to sequentially input symbols, until the end of the sequence. If the final state is an accepting state, the sequence is predicted to be positive by FSA.
The whole process of learning FSA from RNN is presented in Algorithm 1. We call our method LISOR (Learning with Interpretable Structure frOm Rnn) and present two concrete algorithms according to different clustering methods. The one based on kmeans++ is named as “LISORk” while the other one based on kmeansx is called “LISORx”. By utilizing the tendency to form clustering of hidden state points, both LISORk and LISORx can learn a well generalized FSA from RNN models.
4 Experiments and Discussions
In this section, we conduct experiments on both artificial and real tasks and visualize the learned FSAs from corresponding RNN models. Besides that, in both tasks, we discuss that how we interpret the RNN models from FSAs, as well as show the accuracy when using the learned FSAs to do classification.
4.1 Artificial Tasks
In this section, we explore two artificial tasks. The goal of the experiments is to draw a visualized illustration of the learned FSAs and show how to interpret RNNs from the learned FSAs.
4.1.1 Settings
The first artificial task is to identify sequence 0110 in a group of length4 sequences which only contain 0 and 1 (task “0110”). This is a simple task containing 16 distinct cases. We include 1000 instances in the training sets, with duplicated instances to improve accuracy. We use validation set containing all possible length 4 zeroone sequences without duplication to learn FSAs and randomly generate 100 instances to do testing.
The second task is to determine whether a sequence contains three consecutive zeros (task “000”). There is no limitation on the length of sequences, thus the task has infinite instance space and is more difficult than task “0110”. We randomly generate 3000 zeroone training instances whose lengths are also randomly decided. We also generate 500 validation and 500 testing instances.
For both these tasks we mainly study MGU, SRN, GRU and LSTM mentioned in Section 2. For all these four RNN models, we set the dimension of hidden state and the number of hidden layers to be 10 and 3 respectively. We conduct each experiment 5 times and report the average results.
LISORk  LISORx  
RNN Type  MGU  SRN  GRU  LSTM  RNN Type  MGU  SRN  GRU  LSTM 
Trial 1  5  13  7  13  Trial 1  5  13  8  15 
Trial 2  8  9  25  9  Trial 2  8  9  65  10 
Trial 3  6  6  8  12  Trial 3  6  6  8  12 
Trial 4  5  5  8  17  Trial 4  5  5  8  65 
Trial 5  6  22  9  22  Trial 5  6  20  9  24 
Average  6  11  11.2  14.6  Average  6  10.6  19.6  25.2 
LISORk  LISORx  
RNN Type  MGU  SRN  GRU  LSTM  RNN Type  MGU  SRN  GRU  LSTM 
Trial 1  38  84  201  26  Trial 1  31  52  156  25 
Trial 2  6  28  109  72  Trial 2  6  27  137  60 
Trial 3  9  28  201  20  Trial 3  9  18  201  26 
Trial 4  8  41  85  19  Trial 4  8  39  91  22 
Trial 5  7  180  201  22  Trial 5  7  145  201  39 
Average  13.6  72.2  159.4  31.8  Average  12.2  56.2  157.2  34.4 
4.1.2 Discussions on the Number of Clusters
According to Algorithm 1, in order to learn and visualize an FSA, we need to set the cluster number or equally, the number of states in FSA. Note that more clusters mean each cluster contains less hidden state points. A trivial example is that the number of clusters is equal to the number of hidden state points, then the state transition in FSA resembles the way that hidden state points transit in RNNs. So the performance of FSA should be close to that of RNNs when is large enough. Nevertheless, we hope the number of states in FSA to be as small as possible to prevent overfitting, increase efficiency and reduce complexity so as to be easily interpreted by human beings. Thus achieving high accuracy with small number of clusters is a good characteristic and we are attempting to make the number of clusters as small as possible with guaranteed classification performance.
In the task “0110”, we set the number of clusters varying from 2 to 64 (we accumulate hidden points since we only have 16 sequences in validation data and each sequence contains 4 numbers). Table 2 gives the number of clusters required when the accuracy of FSAs learned from the four RNNs first achieves 1.0 which means perfectly identifying all 0110 sequences. We can see that among all four RNN models, FSA learned from MGU always achieves the accuracy 1.0 with the smallest number of clusters in each trial or on average. Specifically, on average, for LISORk the FSA learned from MGU firstly achieves accuracy 1.0 when the number of clusters is 6 followed by that of SRN at cluster number 11. The third one is the FSA learned from GRU with 11.2 clusters, and the final one is that of LSTM with 14.6. For LISORx, the corresponding numbers of clusters are 6, 10.6, 19.6 and 25.2, respectively. We can see that the cluster method kmeansx does not bring too many merits on this simple task compared to kmeans++. It reduces the number of clusters of FSA learned from SRN but increases those of FSAs learned from GRU and LSTM. This phenomenon can be explained that due to the simpleness of this task, kmeans++ already performs well enough, and thus kmeansx does not have space to improve.
In the task “000”, we set the number of clusters ranging from 2 to 200. Actually we have hidden state points where is the average length of all the sequences, but we do not need that many since similar to the task “0110”, large number of clusters may not bring much to performance improvement but may make interpretation from FSA more difficult. This is a more complicated task than task “0110” and neither the original RNN models nor the learned FSA can reach accuracy 1.0 just like that of task “0110”. Thus we focus on the accuracy over 0.7, i.e., we will increase the number of clusters until the accuracy of the learned FSA model reaches an accuracy of 0.7. Thus we focus on the accuracy over 0.7. As can be seen from Table 3, on average for LISORk, FSA learned from MGU firstly achieves accuracy over 0.7 when there are 13.6 clusters. Then FSA learned from LSTM achieved this goal with 31.8 clusters followed by that of SRN at cluster number 72.2. The final one is FSA learned from GRU which achieves 0.7 when the number of clusters is 159.4. For LISORx, the corresponding numbers of clusters for FSA learned from MGU, SRN, GRU and LSTM are 12.2, 56.2, 157.2 and 34.4, respectively. We can see that cluster method kmeansx plays a role in this task which lowers the number of clusters of MGU, SRN and GRU.
4.1.3 Graphical Illustration of FSA
In order to visualize the corresponding FSA for each RNN model, we focus on our first method LISORk as an example and choose the number of clusters that most approaches the average number. In task “0110”, for LISORk, the average number of that first achieves accuracy 1.0 for MGU, SRN, GRU and LSTM are 6, 11, 11.2 and 14.6. Thus we set the number of clusters for MGU to be 6 from trail 3, SRN to be 9 from trial 2, GRU to be 9 from trail 5, LSTM to be 13 from trial 1, respectively.
We then illustrate FSAs’ structure to give people a visual impression of the proposed LISOR’s output in Figure 3, drawn by Graphviz DBLP:books/sp/04/EllsonGKNW04 . Here we use gray circle and double circle to represent start and accepting states, respectively. We mark paths of 0110 sequence by red color. As can be seen, for all length4 zeroone sequences, only 0110 will eventually lead us to an accepting state by following the transitions in illustrated FSAs, and other sequences cannot reach the accepting state. We want to emphasize that by following the flow of FSAs, transitions between states are caused directly by input word. We need not do any numerical calculation as we done in RNN models, thus making the whole process easier to be understandable.
We also illustrate FSAs’ structure of “000” task in Figure 4. Similar to task “000”, we only focus on LISORk as an example and choose the number of clusters that most approaches the average number. For LISORk, the average number of that first achieves accuracy 0.7 for MGU, SRN, GRU and LSTM are 13.6, 72.2, 159.4 and 31.8. Thus we set the number of clusters for MGU to be 6 from trail 2, SRN to be 28 from trial 2, GRU to be 85 from trail 4, LSTM to be 19 from trial 4, respectively. We can see that the corresponding FSAs are much more complex than those of task “0110”. Due to the complexity of this task, different positive sequences will have different ways to reach the final accept state, thus we do not mark paths of transitions by positive sequences.
4.1.4 Interpretation about Gate Effect
We have first impression in section 4.1.2 that MGU can achieve guaranteed accuracy with smaller number of clusters. We will give more details results, i.e., how the accuracy of the learned FSA changes when the number of clusters is increasing.
For task “0110”, the average accuracy tendencies of five trials are shown in 5a and 5b, which correspond to algorithm LISORk and LISORx, respectively. Here we limited the number of clusters to be less than 24, since when it is larger than 24, the performance changes slightly. In Figure 5a and 5b, all FSA models can reach high performance with small number of clusters since the task is not complex. When the number of clusters increase, FSA’s performance may be unstable due to the loss of information when we abandon less frequent transitions. We can see that the FSA learned from MGU always firstly achieves high accuracy and holds the lead.
For task “000”, the average accuracy tendencies of five trials are shown in Figure 5c and 5d shows the average accuracy tendency until the number of clusters is 100. As can be seen from Figure 5c and 5d, all four FSAs’ accuracies increase with number of clusters increasing. MGU firstly achieved high accuracy and holds the lead.
In summary, we observe that the FSA learned from MGU reaches its best performance earlier than other RNN models when the number of clusters increases. Therefore, MGU is the most efficient when its learned FSA possesses more clear illustration and easier interpretability. Inspired by this phenomenon together with the fact that MGU contains less gates on the unit than GRU and LSTM, and also the fact that SRN contains no gates, we tend to treat the gate as a regularizer controlling the complexity of the learned FSAs, as well as the complexity of space of hidden state points, while no gate at all will lead to underfitting. This conclusion motivates us to design other RNN models in the future, which necessarily contain gates, but contain only minimal number of gates as that of MGU.
4.1.5 Ensemble Results of Multiple FSAs
Generally, ensemble of multiple classifiers will improve the classification performance zhou2012ensemble . In this section, we will show the ensemble accuracy results with the increasing of number of clusters of the learned FSA. We focus on MGU since its learned FSA outperforms others from the previous experiment results. We train five MGUs with different initializations of parameters. After we got the corresponding FSAs, we give the final output using majority voting, i.e., only when 3 out of 5 FSAs vote for positive, the output will be positive. The results are shown in Figure 6. We can see that in both tasks, ensemble of multiple FSAs does improve classification performance. It shows that the ensemble of learned multiple structures will lead to better classification in our tasks. We further observe that on the more complex “000” task, the improvement is much larger than that on the easier task “0110”. We conjecture that ensemble of multiple FSAs is more suitable for complex tasks and will continue to use this strategy in more complex real tasks.
4.2 Real Task
In this section, we conduct our experiments on a practical task about sentiment analysis. We will mainly show the accuracy of our learned FSA from the four RNNs on real tasks. We then use the results of the best performed MGU as an example to show that the learned FSA indeed has semantic distinguishing ability.
4.2.1 Settings
In this task, we will use the IMDB dataset DBLP:conf/acl/MaasDPHNP11 to do sentiment analysis DBLP:conf/acl/PoriaCHMZM17 ; DBLP:conf/acl/MaasDPHNP11
, which is a very common task in natural language processing. In this dataset, each instance is the comment for a movie and the task is to classify the given sentence into positive or negative sentiment.
To train the RNN models, we first use word2vec DBLP:journals/corr/abs13013781 to map each English word from film reviews into a 300 dimensions numerical vector. Then we train four different RNNs (MGU, SRN, GRU and LSTM) using these vectors as input. All RNN models’ dimension of hidden states and number of hidden layers are set to be 10 and 3 respectively, and we randomly select 2000 randomlength film reviews as training data. After we get the trained RNN, we learn four different FSAs using 200 testing data. Note that we adopt a transductive setting, i.e. using the test data directly to learn FSAs to ensure all words in test data’s vocabulary be fully covered.
RNN Type  LISORk  LISORx  
Average  Ensemble  Average  Ensemble  
MGU  0.701  0.740  0.740  0.850 
SRN  0.604  0.635  0.592  0.615 
GRU  0.662  0.660  0.699  0.780 
LSTM  0.669  0.750  0.669  0.755 
4.2.2 Discussion on the Number of Clusters
Note that this task is more complex than the artificial tasks, thus we cannot enumerate over all possible number of clusters (i.e., number of hidden states in RNNs). We have tried different number of clusters, that is , from to and found that the smaller is, the better the performance. We understand that if the number of clusters is large enough, FSA will perform better and have similar performance with corresponding RNN models. However, when is small, our empirical results show that simple structure may lead to better performance. Thus in this part, we only exhibit the results when the number of clusters is 2. In this case, all the FSAs possess the simplest structure which is easy to understand as well as be visually illustrated. With same number of clusters, the FSA with higher accuracy is more practical.
4.2.3 Graphical Illustration of FSA
This task has much larger vocabulary size containing thousands of English words, which means the number of symbols (i.e., words) in is not simply 2, which is adopted in the artificial tasks. Thus in order to show the graphical illustration of FSA, we shrink the edges in the same direction between two states into one edge and illustrate the resulted FSA learned from MGU with two clusters in Figure 7. Other FSAs’ structures are similar and we omit them. In this way the words on a shrunk edge are naturally grouped into a class named as “word_class”. We learned five FSAs from five different MGUs according to different initializations. We find that their structures are the same but with different accepting state. As can be seen from Figure 7, the accepting state of trial 1 and trial 5 are State 1 (S_1) while that of trail 2, 3 and 4 are State 0 (S_0).
4.2.4 Accuracy Result
For each of MGU, SRN, GRU and LSTM, we train five different ones according to different initializations and learn five corresponding FSAs from them. We show the average results of the five FSAs’ accuracy in Table 4 for each RNN. We can see that, for both LISORk and LISORx, FSAs learned from MGU have the highest accuracy compared to that of other three RNNs and LISORx performs better than LISORk, which shows the effectiveness of kmeansx that utilizes the extra position feature. In order to show the validity of multiple output structures, we adopt the same strategy as artificial tasks, i.e., combing the results of the five FSAs by ensemble using majority voting. The ensemble classification results of FSAs learned from MGU, SRN, GRU and LSTM are also shown in Table 4. As can be seen, for LISORk, the results of ensemble method are almost better than the case without ensemble except GRU and FSA learned from MGU exhibits competitive performance. For LISORx, the performances of ensemble are all better than the cases without ensemble and the FSA learned from MGU outperforms other RNNs’ FSAs. LISORx is better than LISORk in MGU, GRU and LSTM as well.
4.2.5 Semantic Interpretation
We try to find the semantic meaning behind the transitions between states in FSA. We still focus on MGU due to its FSA’s best performance. The results are shown in Table 5 and Table 6. We consider the transition from State 0 to State 1 in all the five learned FSAs. Table 5 shows the results on the 1th FSA and 5th FSA, according to Figure 7, we realize that this is a transition leading to the accept state. Here the number in the bracket shows the serial number of the FSA from which this word comes. We can see that transitions leading to accepting state contains mainly “positive” words, for example, wonderful, spectacular, sweetness, etc. We can also see that one FSA will only cover one part of the positive words, thus having limited semantic meaning, while multiple structured FSA can make the semantic meaning more plentiful. The results on the 2th and 3th FSA are shown in Table 6, which is a transition leading to the unacceptable state. We can see that most of the activation words of this transition are negative, for example, dullest, unattractive, confusing, etc. We can also conclude that multiple structure can make the semantic meaning more abundant and plentiful.
Positive  riffs(1) Wonderful(1) gratitude(1) diligent(1) spectacular(1) sweetness(1) exceptional(1) Best(1) feats(1) sexy(1) bravery(1) beautifully(1) immediacy(1) meditative(1) captures(1) incredible(1) virtues(1) excellent(1) shone(1) honor(1) pleasantly(1) lovingly(1) exhilarating(1) devotion(1) teaming(1) humanity(1) graceful(1) tribute(1) peaking(1) insightful(1) frenetic(1) romping(1) proudly(1) terrific(1) Haunting(1) sophisticated(1) strives(1) exemplary(1) favorite(1) professionalism(1) enjoyable(1) alluring(1) entertaining(1) sorrowful(1) Truly(1) noble(1) bravest(1) exciting(1) Hurray(1) wonderful(1) Miracle(1)… feelings(5) honest(5) nifty(5) smashes(5) ordered(5) revisit(5) moneyed(5) flamboyance(5) reliable(5) strongest(5) loving(5) useful(5) fascinated(5) carefree(5) recommend(5) Greatest(5) legendary(5) increasing(5) loyalty(5) respectable(5) clearer(5) priority(5) Hongsheng(5) notable(5) reminiscent(5) spiriting(5) astonishing(5) charismatic(5) lived(5) engaging(5) blues(5) pleased(5) subtly(5) versatile(5) favorites(5) remarkably(5) poignant(5) Breaking(5) heroics(5) promised(5) elite(5) confident(5) underrated(5) justice(5) glowing(5) … adventure(1,5) victory(1,5) popular(1,5) adoring(1,5) perfect(1,5) mesmerizing(1,5) fascinating(1,5) extraordinary(1,5) AMAZING(1,5) timeless(1,5) delight(1,5) GREAT(1,5) nicely(1,5) awesome(1,5) fantastic(1,5) flawless(1,5) beguiling(1,5) famed(1,5) 
Negative  downbeat(1) wicked(1) jailed(1) exceptionally(1) corruption(1) eccentric(5) troubled(5) cheats(5) coaxed(5) convicted(5) steals(5) painful(5) cocky(5) endures(5) annoyingly(5) dissonance(5) disturbing(5) goofiness(1,5) 
Positive  merry(2) advance(2) excused(2) beliefs(3) romancing(3) deeper(3) resurrect(3) whitewash(3) 
Negative  shut(2) dullest(2) unattractive(2) Nothing(2) adulterous(2) stinkers(2) drunken(2) hurt(2) rigid(2) unable(2) confusing(2) risky(2) mediocre(2) nonexistent(2) idles(2) horrible(2) disobeys(2) bother(2) scoff(2) interminably(2) arrogance(2) mislead(2) filthy(2) dependent(2) MISSED(2) asleep(2) unfortunate(2) criticized(2) weary(2) corrupt(2) jeopardized(2) drivel(2) scraps(2) phony(2) prohibited(2) foolish(2) reluctant(2) Ironically(2) fell(2) escape(2) … fanciful(3) flawed(3) No(3) corrupts(3) fools(3) limited(3) missing(3) pretense(3) drugs(3) irrational(3) cheesy(3) crappy(3) cheap(3) wandering(3) forced(3) warped(3) shoplift(3) concerns(3) intentional(3) Desperately(3) dying(3) clich(3) bad(3) evil(3) evicted(3) dead(3) minor(3) drunk(3) loser(3) bothered(3) reek(3) tampered(3) inconsistencies(3) ignoring(3) Ward(3) doom(3) quit(3) goofier(3) antithesis(3) fake(3) helplessness(3) surly(3) demoted(3) fault(3) worst(3) baffling(3) destroy(3) fails(3) Pity(3) pressure(3) nuisance(3) farce(3) fail(3) worse(3) SPOLIER(3) egomaniacal(3) quandary(3) burning(3) drinker(3) blame(3) intimidated(3) perfidy(3) boring(3) conservative(3) forgetting(3) hostile(3) … unattractive(2,3) goof(2,3) lousy(2,3) stupidest(2,3) mediocrity(2,3) Badly(2,3) mediocre(2,3) waste(2,3) hypocrite(2,3) confused(2,3) vague(2,3) clumsily(2,3) stupid(2,3) 
5 Conclusion
It will be beneficial if we can learn an interpretable structure from the RNN models since there is still no clear understanding of the inner mechanism of RNN models. In this paper, realizing the similarity between RNNs and FSA, as well as the good interpretability of FSA, we try to learn FSA from RNN, and analyze RNNs from FSA’s point of view. After verifying that the hidden states of gated RNNs do form clusters, we propose two methods to learn FSAs from four kinds of RNNs, based on different clustering strategies. We show the learned FSA graphically through illustration and explicitly give the transition route for human beings to follow. We also show how the number of gate affects the performance of RNNs, and the semantic meaning behind the numerical calculation in hidden units. We find that MGU with minimal gate can outperform other RNNs from the FSA’s perspective. In the future, we plan to design other RNN models sharing the merit of minimal number of gate as MGU.
References
 [AS83] Dana Angluin and Carl H. Smith. Inductive inference: Theory and methods. ACM Computing Surveys, 15(3):237–269, 1983.
 [AV07] David Arthur and Sergei Vassilvitskii. kmeans++: the advantages of careful seeding. In SODA, pages 1027–1035, 2007.
 [BHS07] Gökhan BakIr, Thomas Hofmann, Bernhard Schölkopf, Alexander J Smola, Ben Taskar, and SVN Vishwanathan. Predicting structured data. MIT press, 2007.
 [BL08] Matthew B. Blaschko and Christoph H. Lampert. Learning to localize objects with structured output regression. In ECCV, pages 2–15, 2008.
 [BSF94] Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.

[Col02]
Michael Collins.
Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms.
In EMNLP, pages 1–8, 2002.  [CSYU15] LiangChieh Chen, Alexander G. Schwing, Alan L. Yuille, and Raquel Urtasun. Learning deep structured models. In ICML, pages 1785–1794, 2015.
 [CvMG14] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. In EMNLP, pages 1724–1734, 2014.
 [EGK04] John Ellson, Emden R. Gansner, Eleftherios Koutsofios, Stephen C. North, and Gordon Woodhull. Graphviz and dynagraph  static and dynamic graph drawing tools. In Graph Drawing Software, pages 127–148. 2004.
 [Elm90] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
 [GBC16] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. MIT Press, 2016.
 [GBK12] Abner GuzmánRivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. In NIPS, pages 1808–1816, 2012.
 [GGCC94] Mark W. Goudreau, C. Lee Giles, Srimat T. Chakradhar, and D. Chen. Firstorder versus secondorder singlelayer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3):511–513, 1994.
 [Gil62] Arthur. Gill. Introduction to the theory of finitestate machines. McGrawHill New York, 1962.
 [Gol78] E. Mark Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978.
 [Gon17] Chen Gong. Exploring commonality and individuality for multimodal curriculum learning. In AAAI, pages 1926–1933, 2017.
 [GSC00] Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
 [GTM16] Chen Gong, Dacheng Tao, Stephen J. Maybank, Wei Liu, Guoliang Kang, and Jie Yang. Multimodal curriculum learning for semisupervised image classification. IEEE Transactions on Image Processing, 25(7):3249–3260, 2016.
 [HDY12] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29:82–97, 2012.
 [HFCB08] Eyke Hüllermeier, Johannes Fürnkranz, Weiwei Cheng, and Klaus Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(1617):1897–1916, 2008.
 [HGS16] Sam Hare, Stuart Golodetz, Amir Saffari, Vibhav Vineet, MingMing Cheng, Stephen L. Hicks, and Philip H. S. Torr. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.
 [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [HW79] JA Hartigan and MA Wong. Algorithm AS 136: A Kmeans clustering algorithm. Applied Statistics, pages 100–108, 1979.
 [KJL15] Andrej Karpathy, Justin Johnson, and FeiFei Li. Visualizing and understanding recurrent networks. CoRR, abs/1506.02078, 2015.
 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
 [Lav15] Antonio Lavecchia. Machinelearning approaches in drug discovery: methods and applications. Drug Discovery Today, 20(3):318–331, 2015.
 [Lip16] Zachary Chase Lipton. The mythos of model interpretability. CoRR, abs/1606.03490, 2016.
 [LST17] Weiwei Liu, XiaoBo Shen, and Ivor W. Tsang. Sparse embedded kmeans clustering. In NIPS, pages 3321–3329, 2017.
 [LT15] Weiwei Liu and Ivor W. Tsang. On the optimality of classifier chain for multilabel classification. In NIPS, pages 712–720, 2015.
 [LTM17] Weiwei Liu, Ivor W. Tsang, and KlausRobert Müller. An easytohard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research, 18:94:1–94:38, 2017.
 [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
 [MDP11] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In ACL, pages 142–150, 2011.
 [OG96] Christian W. Omlin and C. Lee Giles. Extraction of rules from discretetime recurrent neural networks. Neural Networks, 9(1):41–52, 1996.
 [PCH17] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and LouisPhilippe Morency. Contextdependent sentiment analysis in usergenerated videos. In ACL, pages 873–883, 2017.
 [PD09] Hoifung Poon and Pedro M. Domingos. Unsupervised semantic parsing. In EMNLP, pages 1–10, 2009.
 [SLT17] XiaoBo Shen, Weiwei Liu, Ivor W. Tsang, Fumin Shen, and QuanSen Sun. Compressed kmeans for largescale clustering. In AAAI, pages 2527–2533, 2017.
 [SLT18] Xiaobo Shen, Weiwei Liu, Ivor W. Tsang, QuanSen Sun, and YewSoon Ong. Compact multilabel learning. In AAAI, 2018.
 [TJHA05] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005.
 [TQL15] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural network for sentiment classification. In EMNLP, pages 1422–1432, 2015.
 [vdM09] Laurens van der Maaten. Learning a parametric embedding by preserving local structure. In AISTATS, pages 384–391, 2009.
 [VG10] Shankar Vembu and Thomas Gärtner. Label ranking algorithms: A survey. In Preference Learning., pages 45–64. 2010.
 [VTBE15] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
 [WHP18] Mike Wu, Michael C. Hughes, Sonali Parbhoo, Maurizio Zazzi, Volker Roth, and Finale DoshiVelez. Beyond sparsity: Tree regularization of deep models for interpretability. In AAAI, 2018.
 [WMC09] Kai Wang, Zhaoyan Ming, and TatSeng Chua. A syntactic tree matching approach to finding similar questions in communitybased qa services. In SIGIR, pages 187–194, 2009.
 [WYJ18] Yue Wang, Dawei Yin, Luo Jie, Pengyuan Wang, Makoto Yamada, Yi Chang, and Qiaozhu Mei. Optimizing wholepage presentation for web search. ACM Transactions on the Web, 12(3):19, 2018.
 [YCN15] Jason Yosinski, Jeff Clune, Anh Mai Nguyen, Thomas J. Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. CoRR, abs/1506.06579, 2015.
 [ZGS93] Zheng Zeng, Rodney M. Goodman, and Padhraic Smyth. Learning finite state machines with selfclustering recurrent networks. Neural Computation, 5(6):976–990, 1993.
 [Zho12] ZhiHua Zhou. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.
 [ZWZZ16] GuoBing Zhou, Jianxin Wu, ChenLin Zhang, and ZhiHua Zhou. Minimal gated unit for recurrent neural networks. CoRR, abs/1603.09420, 2016.
 [ZZ14] MinLing Zhang and ZhiHua Zhou. A review on multilabel learning algorithms. IEEE Transactions on Knowledge Data Engineering, 26(8):1819–1837, 2014.
Comments
There are no comments yet.