Weighted Finite State Transducers (WFSTs) are complex mathematical objects which find application in many fields ranging from language and speech processing to computational biology. There exists a multitude of algorithms that operate on WFSTs ([Mohr00], [Mohr09], [MPR02]). The most prominent and most studied is the Viterbi algorithm ([Forn73], [Rabi89], [Vite67]), which stems from the field of telecommunications. Others include the weight pushing and the epsilon removal algorithms, stemning from computer science. However, such algorithms are usually computationally expensive, which is undesirable for practical applications. Moreover, besides algorithms that simply utilise the WFST, there are more intrusive algorithms that alter its parameters in an effort to optimise subsequent decoding, while maintaining its inherent structure. Some of these algorithms aim to directly reduce the size of the states and arcs in WFST, and thus immediately affecting the time requirements of the decoding. On the other hand, certain algorithms try to indirectly affect the execution speed, by adapting the weights between states so that pruning algorithms examine fewer paths.
These algorithms admit modeling through tropical algebra and tropical geometry ([Kuo06], [PaSt04b], [Mohr00], [Mohr09], [MPR02]), however no efforts have been made to thoroughly explore their tropical aspects beyond the expression of scalar arithmetic with operations from the tropical semiring. For detailed background on tropical algebra and the tropical semiring we refer the reader to [Cuni79], [Butk10], [GoMi08], [Mara17], [Simo94], and [Pin98]. In this paper we model the algorithms using tropical algebra and matrix operations, resulting in novel expressions in closed matrix form. We also explain aspects of the geometry of certain algorithms, namely the Viterbi pruning.
[Pin98] first introduces the min-plus arithmetic. References [Mohr00], [Mohr09], and [MPR02] are some of the most influential of the field, studying the WFST structures and proposing the corresponding algorithms. In [Vite67] the Viterbi algorithm was first introduced as an optimal decoding algorithm. Reference [Cuni79] is a thorough study of nonlinear algebras, namely the minimax algebra. In [Butk10] the author focuses on max-plus algebra from a control theory viewpoint. Max-plus algebra is also studied, along with its applications, in [BCOQ93] and [Gaub97]. In [Mara17] the author offers a comprehensive study of systems on weighted lattices as a unification of max-plus algebra and its generalisations. References [ChMa17] and [ThMa18]
from our group are efforts to model perceptrons in max-plus algebra, in the case of the former, and the Viterbi algorithm and its pruning variant in min-plus algebra, in the case of the latter.
In this paper we provide a theoretical unification of WFSTs algorithms by modeling them using tropical algebra, which also allows for their further analysis using tools from minimax matrix theory ([Cuni79]). We first model the weight pushing algorithm, an non-intrusive algorithm that aims to speed up pruning by propagating the weights to earlier states of the WFST. Then we model the epsilon removal algorithm, which alters the structure of the WFST in order to remove unecessary states and trastitions, thus reducing its size and immediately affecting decoding. We present previous results regarding the modeling of the Viterbi algorithm and its pruning variant. Finally, we further explore the properties of certain metrics defined through the Viterbi pruning and elaborate on their motivation. Our modeling aspires to offer a connection with and unification via nonlinear vector space theory of weighted lattices ([Mara17]) and aspires to allow for spectral analysis of these algorithms. In addition, we provide links with tropical geometry, similar to the efforts in [ChMa17] and [ThMa18].
In Section 2 we present elements of tropical algebra that will be useful in our analysis. Section 3 contains the modeling of tha various algorithms in tropical algebra; namely the weight pushing, epsilon removal, and Viterbi algorithms. Finally, in Section 4 we revisit the geometry of the Viterbi pruning and we better explain the motivation for and the properties of metrics defined in previous work ([ThMa18]).
Tropical algebra is similar to linear algebra. Like linear algebra studies systems of linear equations and their properties, tropical algebra studies systems of nonlinear equations (namely, min-plus equations) and their properties. Its main pair of operations is the pair , and we will use to denote the minimum. The vectors and matrices of tropical algebra exist on the extended real multidimensional space defined by . In this paper, we follow the notation of [Mara17] for the operations on weighted lattices. Let . Then the min-plus product between these matrices, denoted by , is given by:
We will also make extensive use of two very important matrices for tropical algebra. In partiular, we will use:
the matrix of a matrix , defined as:
the matrix of a matrix , defined as:
We can see that
. These two matrices are very important in tropical algebra, because they provide solutions to the eigenvector problems. In particular:
provides solutions to the min-plus eigenvector-eigenvalue problem.
the matrix provides solutions to the generalised min-plus eigenvector-eigenvalue problem .
Tropical geometry ([Zieg12], [MaSt15]
) aims to generalise the ideas of Euclidean geometry to the tropical setting. This proves useful in many cases because tropical curves are piecewise linear, which offers immediate bounds for the solution space of problems, but also offers ties to linear programming and its algorithms. Similar to its Euclidean counterpart, a tropical line is given by Equation (4):
Similarly to the tropical lines we can define tropical halfspaces as:
Let . An affine tropical halfspace is a subset of defined by:
In the text we will reference tropical polytopes. These mathematical objects arise from the combination of tropical halfspaces:
A bounded intersection of a finite number of tropical halfspaces is will be called a tropical polytope.
3.1 Weight pushing
The weight pushing algorithm is an essential algorithm for practical application of the WFST framework. The algorithm aims to propagate the weights to earlier states of the structure, to the effect that low-probability paths are recognised earlier in the decoding sequence, and thus have a higher chance of being pruned by pruning algorithms. An irrevocable requirement is that the underlying structure of the WFST must remain the same: the algorithm might alter the weights, but the set of accepted paths and their total weights must stand unaffected. An example highlighting the weight pushing operation appears in Figure1. An improbable path that has, at an early stage, a low cost will consume computational resources, where that could have been avoided by pushing the overall weight in earlier transitions.
The algorithm can be divided into two parts; a first part, where a potential (meaning the amount that can be propagated to earlier states) is calculated, and a second part where the actual update of the parameters occurs. A single iteration of the traditional algorithm for calculating the potential can be written in the form:
where is the potential vector for the -th iteration, and , where is the emission vector. By recursively substituting the values, we get that the final value of the potential vector is:
The calculation of Equation (6)
is finite and .
The claim is proven by the fact that we have assumed that there aren’t any cycles of negative length in the WFST, and such the shortest paths between every pair of states are finite.
Having computed the potential vectors, we define four diagonal matrices that will be useful for updating the parameters of the WFST. In particular:
The matrix of the input weights, whose diagonal is the input weight vector .
The matrix of the potentials, whose diagonal is the potential vector .
The matrix of the negative potentials, whose diagonal is the negative of the potential vector .
The matrix of the emission weights, whose diagonal is the emission weight vector .
Having defined these matrices, the updated parameters of the WFST are as follows:
3.2 Epsilon removal
Epsilon removal is an algorithm that aims to reduce the number of states and transitions in the WFST, while maintaining its underlying structure, in order to reduce the running time of the Viterbi algorithm. To accomplish that, an effort is made to reduce epsilon transitions (meaning transtitions with no input or output symbols).
The traditional algorithm for epsilon removal can be illustrated in Figure 2. In essence, the algorithm computes, for each state its epsilon closure, and then adds transitions from the state to each state reachable from states in the epsilon closure.
To model the traditional epsilon removal algorithm in tropical algebra we need to define two matrices, in addition to the transition matrix of the WFST:
the matrix , which is the input symbol matrix and contains the input symbol for the transition of the state to state .
the matrix , which is the output symbol matrix and contains the output symbol for the transition of the state to state .
For completeness sake, we need to make two remarks before we proceed to the modeling:
We only consider as epsilon transitions ones where both the input and the output symbols are . This is a very common assumption in the field, and usually a synchronization algorithm has already been performed, in order to better match input and output .
We assume that there can only be a single transition between two states, regardless of whether there exist transitions with different symbols or weights. While this might seem restrictive, in practice it isn’t, and can even be circumvented.
We need to define another two matrices in order to model epsilon removal, which make up the transition matrix :
Essentially, we decompose matrix using tha matrices of Equation (9). We can see that .
Let the matrix defined in (9). Then, the matrix
is finite and equal to , and moreover expresses the epsilon closure for all the states of the WFST.
The claim is proven by the fact that an inherent assumption in WFSTs is that there aren’t any cycles of negative weight (and thus the shortest distances are finite). Since there aren’t any cycles of negative weight in the original WFST, there aren’t any such cycles in the WFST where we kept only the epsilon transitions. Having the epsilon closure of each state, the updated transition matrix and emission vector are simply the tropical addition (that is, the minimum) between the previous values and the values that emerge from the epsilon closure. In particular, the new transition matrix takes the form:
whereas the new emission vector takes the form:
The Viterbi algorithm aims to decode a sequence of input symbols, meaning that it tries to map the sequence of symbols to the sequence of states that has the highest probability. Formally, it is known that the Viterbi algorirthm can be written in the following max-product form:
where is the probability of transitioning from state to state , denotes the observation probability of the symbol at state , and, finally, is the maximum probability for that current state, calculated along the path from the previous states. In [ThMa18] we postulated that the Viterbi algorithm can be written in a closed matrix form in tropical algebra as:
where , , and is a diagonal matrix whose diagonal is the vector , with .
3.4 Viterbi pruning
The Viterbi pruning is a variant of the Viterbi algorithm that aims to sacrifice the optimality of decoding in an effort to significantly speed up the decoding process. Usually, pruning is based on one of the following criteria, or even their combination:
users determine a leniency parameter , and at each step only the paths that are at most from the optimal path survive.
users determine a beam width , and at each step only the -best paths survive the pruning.
In [ThMa18] we modeled the Viterbi pruning in tropical algebra using Cuninghame-Green’s inverse ([Cuni79]). Therein it is proven that the negative elements of
indicate the indices that need to be pruned. The matrix is a diagonal matrix whose diagonal is the state vector , and . Also, , where is the leniency parameter and is a vector that comprises solely of 0. Finally, denotes the max-plus matrix multiplication.
Moreover, if we consider a variable vector and bound it using:
the Viterbi update law of Equation (14), and thus:
the pruning vector of Equation (16), and thus
a normalised volume metric :
a normalised entropy metric :
where . Essentially, is the degree to which each dimension satisfies the Viterbi constraints.
4 Discussion of Geometry
We devote this section to the further analysis of the metrics of Equations (19) and (20), and also the motivation behind their definition. At every iteration of the Viterbi pruning algorithm consider the state vector along with the leniency vector of Equation (18). In unison, these vectors define a tropical polytope for each iteration of the algorithm. The indices of the state vector that satisfy the constraints imposed by the leniency vector act as the sides of this polytope, and the difference between the value of the leniency vector and the state vector at that index constitute the vector of Equation (19). Figure 3 visualises the polytope of each iteration, and also highlights the vector . Duscussing the metrics further:
Consider the normalised volume of (19). The metric
can offer a quantitative estimate of the solution space that the Viterbi pruning admits. Indeed, since values of’s in Equation (19) are normalised, utilising this metric can provide a measure of how many paths the current choice of the leniency parameter allows to survive. Exploiting that remark, it is possible to monitor how this metric evolves throughout iterations, and adapt, when needed, the value of the leniency paramater in order to maintain a desired level of normalised volume.
Consider the normalised entropy of (20). The metric can offer a qualitative estimate of the solution space that the Viterbi pruning admits. In information theory, entropy expresses the current degree of surprise incured by the observation of a sample. In essence, if the sample abides by the existing modeling of the assumed distribution, then it will have low entropy, as its value is in an expected range. However, if the sample has a significantly different value than those expected by the assumptions for the distribution, then the sample will have very high entropy, indicating that there may be an error in the original modeling of the distribution.
Thus, by utulising the above metrics we aim to reason about the solution space of the Viterbi pruning in two ways; a quantitative analysis of the relative size of the solution space, and a qualitative analysis of the likelihood of the paths of the solution space. Having such measures, we can examine how the solution space, and the quantity/quality of these solutions evolves over the execution of the Viterbi algorithm. Even more, we can introduce them to the design of the algorithm, so that the leniency parameter gets adapted to the needs of each iteration.
In this work we modeled algorithms that operate on WFSTs using tropical algebra and matrix operations on weighted lattices, unifying them under a common framework. First, we modeled the weight pushing algorithm by expressing the potential calculation as an instrumental matrix of tropical algebra. We then proceeded to model the epsilon removal algorithm by exploiting the -superposition of tropical algebra and expressing the epsilon closure as another important matrix in tropical algebra. Finally, we analysed some geometrical aspects of the Viterbi pruning, elaborating on metrics that were defined on previous work. In future work we aim to explore the connection of the structures with nonlinear vector space theory of weighted lattices and nonlinear spectral theory.