Hybrid system identification aims at estimating a model of a system switching between different operating modes from input-output data. More precisely, most of the literature considers autoregressive with external input (ARX) models to cast the problem as a regression one . Then, two cases can be distinguished: switching regression, where the system arbitrarily switches from one mode to another, and piecewise affine (PWA) regression, where the switches depend on the regressors. A number of methods with satisfactory performance in practice are now available for these problems . However, compared with linear system identification, a major weakness of these methods is their lack of guarantees.
For the particular case of noiseless data, the algebraic method  provides a solution to switching regression with a small number of modes. However, the quality of the estimates quickly degrades with the increase of the noise level. A few sparsity-based methods [4, 5] also offer guarantees in the noiseless case, but these are subject to a condition on both the data and the sought solution. In the presence of noise, most methods consider the minimization of the error of the model over the data . While this does not necessarily yields the best predictive model (due to issues like identifiability, persistence of excitation and access to a limited amount of data), obtaining statistical guarantees with such an approach has a long history in statistics and system identification 
. However, such results are not available for hybrid systems. This is probably due to the fact that minimizing the error of a hybrid model is a difficult nonconvex optimization problem involving the simultaneous classification of the data points into modes and the regression of a submodel for each mode. Thus, theoretical guarantees could only be obtained under the rather strong assumption that this problem has been solved to global optimality and most of the literature[7, 8, 9, 10, 11, 12]
focuses on this issue with heuristics of various degrees of accuracy and computational efficiency. Many recent works[4, 13, 14, 15, 16, 5] try to avoid local minima by considering convex formulations, but these only yield optimality with respect to a relaxation of the original problem. Global optimality in the presence of noise was only reached in 
for a particular class of continuous PWA maps known as hinging-hyperplanes by reformulating the problem as a mixed-integer program solved by branch-and-bound techniques. However, such optimization problems are NP-hard and branch-and-bound algorithms have a worst-case complexity exponential in the number of integer variables, here proportional to the number of data and the number of modes.
Inspired by related clustering problems, such as the minimization of the sum of squared distances between points and their group centers, we could minimize the hybrid model error by enumerating all possible classifications of the points. But the number of classifications is exponential in the number of data. Conversely, the other approach enumerating a sample of values for the real variables of the problem is exponential in the dimension and can only offer an approximate solution.
Overall, the literature does not provide a method that can guarantee both the optimality and the computability of a global minimizer of the error, while the computational complexity of this problem remains unknown and cannot be deduced from the NP-hardness of classical clustering problems  (see [18, 20] for an introduction to computational complexity and its relevance to control theory).
The paper provides two results regarding the computational complexity of PWA regression, and more precisely for the problem of finding a global minimizer of the error of a PWA model, formalized in Sect. 2. First, we show in Sect. 3 that the problem is NP-hard. Then, we show in Sect. 4 that, for any fixed dimension of the data, an exact solution can be computed in time polynomial in the number of data via an enumeration of all possible classifications. To obtain this result and avoid the exponential growth of the number of classifications with the number of data, we show that, in PWA regression, the classification of the data points is highly constrained and the number of classifications to test can be limited. The price to pay for this gain is an exponential complexity with respect to the data dimension and the number of modes. Future work is outlined in Sect. 5.
We use the indicator function of an event that is if the event occurs and otherwise. We define if and otherwise. Given a set of labels and a set of points, a labeling of these points is any . We use as a shorthand for . Given two sets, and , is the set of functions from to .
2 Problem formulation
As in most works, we concentrate on discrete-time PWARX system identification considered as a PWA regression problem with regression vectorsbuilt from past inputs and outputs . Since we are interested in computational complexity results, we restrain the data to rational, digitally representable, values and set . The outputs are assumed to be generated by a PWA system as , where is a noise term. More precisely, PWA models can be expressed via a set of affine submodels and a function determining the active submodel: , where .
We call the function
a classifier as it classifies the data points in the different modes. Typically, PWA systems are defined withimplementing a polyhedral partition of , with modes possibly spanning unions of polyhedra. However, in most of the literature on PWA system identification [1, 7, 8, 9, 16], is estimated within the family of linear classifiers
based on a set of linear functions and for which a mode spanning a union of polyhedra must be modeled as several modes with similar affine submodels. For PWA maps with modes, is a binary classifier for which it is common to consider its output in instead of . Such a binary classifier can be obtained by taking the sign of a real-valued function. If this function is linear (or affine), then we obtain a linear classifier, which is equivalent to a separating hyperplane dividing the input space in two half-spaces. In this case, the function class can be defined as
with a single set of parameters corresponding to the normal to the hyperplane and the offset from the origin. An equivalence with the multi-class formulation in (1) is obtained by using and .
In this paper, we consider the common estimation approach of minimizing the error on data pairs
, measured pointwise by a loss functionas
More precisely, we focus on well-posed instances of the problem where is significantly larger than the dimension and the number of modes is given. Indeed, with free the problem is ill-posed as the solution is only defined up to a trade-off between the number of modes and the model accuracy. For the converse well-posed approach that minimizes for a given error bound, a complexity analysis can be found in . Under these assumptions, the problem is as follows.
Problem 1 (Error-minimizing PWA regression).
The following analyzes the time complexity of Problem 1
under the classical model of computation known as a Turing machine. The time complexity of a problem is the lowest time complexity of an algorithm solving any instance of that problem, where the time complexity of an algorithm is the maximal number of steps occuring in the computation of the corresponding Turing machine program. The loss function is assumed to be computable in polynomial time throughout the paper.
This section contains the proof of the following NP-hardness result, where an NP-hard problem is one that is at least as hard as any problem from the class NP of nondeterministic polynomial time decision problems (NP is the class of all decision problems for which a solution can be certified in polynomial time).
With a loss function such that , Problem 1 is NP-hard.
The proof uses a reduction from the partition problem, known to be NP-complete , i.e., a problem that is both NP-hard and in NP.
Problem 2 (Partition).
Given a multiset (a set with possibly multiple instances of its elements) of positive integers, , decide whether there is a multisubset such that
Problem 3 (Decision form of PWA regression).
Problem 3 is NP-complete.
Since given a candidate pair the condition (4) can be verified in polynomial time, Problem 3 is in NP. Then, the proof of its NP-completeness proceeds by showing that the Partition Problem 2 has an affirmative answer if and only if Problem 3 with has an affirmative answer.
Given an instance of Problem 2, let , , and build a data set with
where , is the set of indexes of the elements of in and . This gives
and we can similarly show that
It remains to prove that if (4) holds with , then Problem 2 has an affirmative answer. To see this, note that due to being positive, a zero cost implies a zero loss for all data points. Thus, by , if (4) holds with ,
Also note that if for some , we have . This is only possible if , which is not the case (otherwise we can simply remove without influencing the partition problem), or if . The latter is impossible if since is a linear classifier that must return the same category for all points on the line segment between and , which includes the origin and thus would imply by (5) that . As a consequence, cannot hold, and since we can similarly show that cannot hold, we have for all . Hence, (5) leads to
Let and . Then, if , and for all , . Therefore, for all , , while for all , (6) gives , i.e., and . This leads to
Thus, if or , a valid partition in the sense of Problem 2 is obtained with . In addition, if and , then by (5), , which by construction implies that . In this case, we redefine and to obtain for all and for all , resulting also in a valid partition by the fact that . Since a similar reasoning applies to the case by symmetry (substituting for ), a zero cost, i.e., (4) with , always implies an affirmative answer to Problem 2. ∎
4 Polynomial complexity in the number of data
We now state the result regarding the polynomial complexity of Problem 1 with respect to under the following assumptions, the first of which holds almost surely for randomly drawn data points, while the second one holds for instance for with a linear time complexity .
The points are in general position, i.e., no hyperplane of contains more than points.
Given , the problem has a polynomial time complexity for any fixed integer .
The proof of Theorem 2 relies on the existence of exact algorithms with complexity polynomial in for the binary case (, Proposition 4) and the multi-class case (, Corollary 1). These algorithms are based on a reduction of Problem 1 to a combinatorial search in two steps. The first step reduces the problem to a classification one. Indeed, Problem 1 can be reformulated as the search for the classifier , since by fixing , the optimal parameter vectors can be obtained by solving
independent linear regression problems on the subsets of data resulting from the classification by, which, by Assumption 2, can be performed in the polynomial time . This yields the following reformulation of the problem.
Problem 1 is equivalent to
4.1 Finding the optimal classification
We reduce the complexity of searching for the classifier by considering all possible linear classifications instead of all possible linear classifiers. In other words, we project the class of classifiers onto the set of points
to reduce a continuous search to a combinatorial problem. This is in line with the techniques used in statistical learning theory for the different purpose of computing error bounds for infinite function classes. Thus, we introduce definitions from this field.
Definition 1 (Projection onto a set).
The projection of a set of classifiers onto , denoted , is the set of all labelings of that can be produced by a classifier in :
Definition 2 (Growth function).
The growth function of at is the maximal number of labelings of points that can be produced by classifiers from :
We now focus on binary PWA maps and thus on binary classifiers with output in . For such classifiers, we obviously have for all . By further restricting to affine classifiers as in (2), results from statistical learning theory (see, e.g., ) provide the tighter bound , which is polynomial in and thus promising from the viewpoint of global optimization. However, its proof is not constructive and does not provide an explicit algorithm for enumerating all the labelings. The following theorem, though leading to a looser bound on the growth function, offers a constructive scheme to compute the projection , which is what we need in order to test all the labelings in for global optimization.
The growth function of the class of binary affine classifiers of , in (2), is bounded for any by
and, for any set of points in general position, an algorithm builds the projection in time.
For any binary affine classifier in (2) and any finite set of points in general position, there is a subset of points of cardinality and a separating hyperplane of parameters passing through the points in , i.e.,
which yields the same classification of in the sense that
Proof sketch. For all classifiers with separating hyperplanes passing through points of , the statement is obvious. For the others passing through points with , they can be transformed to pass through additional points without changing the classification of the remaining points. If , it suffices to translate the hyperplane to the closest point. If , the hyperplane can be rotated with a plane of rotation that leaves unchanged the subspace spanned by the points and a minimal angle yielding a rotated hyperplane passing through points, where by the general position assumption. Iterating this scheme until yields a hyperplane passing through the points in of parameters satisfying (9) and (10). ∎
We can now prove Theorem 3.
Proof of Theorem 3.
For any labeling in , there is a classifier that produces this labeling. Applying Proposition 3 to , we obtain another classifier of parameters that passes through the points in and that agrees with on . Let be defined by , . Then, we generate labelings by setting its entries with to all the combinations of signs (recall that ). By construction, there is no labeling of that agrees with on other than these labelings. Since this holds for any , the cardinality of cannot be larger than times the number of hyperplanes passing through points of . Since each subset of cardinality gives rise to two hyperplanes of opposite orientations, the number of such hyperplanes is and we have . In addition, there is an algorithm that enumerates all the subsets in iterations and builds by computing a hyperplane passing through the points111The normal of a hyperplane passing through points in can be computed as a unit vector in the null space of , while the offset is given by for any of the ’s. in and the corresponding labelings at each iteration. Since these inner computations can be performed in constant time with respect to , the algorithm has a time complexity in the order of . ∎
4.2 Global optimization of binary PWA models
We can use the results above to reduce the complexity of Problem 1 in the binary case, considered in the following in its equivalent form (8) from Proposition 2. First, note that the cost function in (8) only depends on , since all feasible values of for a given yield the same cost. Furthermore, the cost does not depend on the exact value of , but only on the resulting classification, i.e., on , . Thus, given a global solution to (8), any classifier producing the same classification yields the same cost function value and hence is also a global solution. Thus, the problem reduces to the search for the correct classification , whose complexity is in and bounded by Theorem 3. In addition, for the purpose of binary PWA regression, opposite labelings and are equivalent and can be pruned from . This is due to the symmetry of the cost function (8). Algorithm 1 provides a solution to Problem 1 for the binary case while taking this symmetry into account.
By following a similar path as for Theorem 3, Algorithm 1 can be proved to test all linear classifications of the data points up to symmetric ones. Since Algorithm 1 computes a solution in terms of that is feasible for (8) for each of these classifications, the value of coincides with the cost function of (8) for a particular . By the symmetry of this cost function with respect to and the fact that it only depends on via its values at the data points, Algorithm 1 computes all possible values of the cost function, including the exact global optimum of (8), and returns a global minimizer. Thus, by Proposition 2, it also solves Problem 1. The total number of iterations of Algorithm 1 is and, under Assumption 2, these iterations only involve operations computed in polynomial time in the order of , hence the overall time complexity in the order of . ∎
4.3 Multi-class extension
For , the boundary between 2 modes and implemented by a linear classifier from in (1) is a hyperplane of equation , i.e., based on the difference of the two functions and . Based on these hyperplanes, the classification rule can be written as
Based on these facts, we can build an algorithm to recover all possible classifications consistent with a linear classification in the sense of (1).
For the set of multi-class linear classifiers of , in (1), the growth function is bounded for any by
and, for any set of points of in general position, an algorithm builds in time.
Any classification produced by a classifier from (1) can be computed from the signs of the functions , , corresponding to the pairwise separating hyperplanes. For any , for each of these hyperplanes, Proposition 3 provides an equivalent binary classifier which must be one from the hyperplanes passing through points of . The number of sets of such hyperplanes is . Since these classifiers cannot produce all the classifications of the points in the sets , we must also take these into account so that the number of classifications of is upper bounded by . This upper bound holds for any , and thus also applies to the growth function. Finally, an algorithm that makes explicit all the classifications mentioned above to build can be constructed in a recursive manner, with one classification per iteration and thus with a similar number of iterations, each one including computations performed in constant time. ∎
Theorem 4 implies the following for PWA regression.
The paper discussed complexity issues for PWA regression and showed that i) the global minimization of the error is NP-hard in general, and ii) for fixed number of modes and data dimension, an exact solution can be obtained in time polynomial in the number of data. The proof of NP-hardness also implies that the problem remains NP-hard even when the number of modes is fixed to , which indicates that the complexity is mostly due to the data dimension. An open issue concerns the conditions under which a PWA system generates trajectories satisfying the general position assumption used by the polynomial-time algorithm. Future work will also focus on the extension of the results to the case of arbitrarily switched systems and heuristics inspired by the polynomial-time algorithm, whose practical application remains limited by an exponential complexity in the dimension.
The author would like to thank the anonymous reviewers for their comments and suggestions. Thanks are also due to Yann Guermeur for carefully reading this manuscript.
-  S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, R. Vidal, Identification of hybrid systems: a tutorial, European Journal of Control 13 (2-3) (2007) 242–262.
-  A. Garulli, S. Paoletti, A. Vicino, A survey on switched and piecewise affine system identification, in: Proc. of the 16th IFAC Symp. on System Identification (SYSID), 2012, pp. 344–355.
-  R. Vidal, S. Soatto, Y. Ma, S. Sastry, An algebraic geometric approach to the identification of a class of linear hybrid systems, in: Proc. of the 42nd IEEE Conf. on Decision and Control (CDC), Maui, Hawaï, USA, 2003, pp. 167–172.
-  L. Bako, Identification of switched linear systems via sparse optimization, Automatica 47 (4) (2011) 668–677.
-  I. Maruta, H. Ohlsson, Compression based identification of PWA systems, in: Proc. of the 19th IFAC World Congress, Cape Town, South Africa, 2014, pp. 4985–4992.
-  L. Ljung, System identification: Theory for the User, 2nd Edition, Prentice Hall, 1999.
-  G. Ferrari-Trecate, M. Muselli, D. Liberati, M. Morari, A clustering technique for the identification of piecewise affine systems, Automatica 39 (2) (2003) 205–217.
-  A. Bemporad, A. Garulli, S. Paoletti, A. Vicino, A bounded-error approach to piecewise affine system identification, IEEE Transactions on Automatic Control 50 (10) (2005) 1567–1580.
-  A. L. Juloski, S. Weiland, W. Heemels, A Bayesian approach to identification of hybrid systems, IEEE Transactions on Automatic Control 50 (10) (2005) 1520–1533.
-  F. Lauer, G. Bloch, R. Vidal, A continuous optimization framework for hybrid system identification, Automatica 47 (3) (2011) 608–613.
-  F. Lauer, Estimating the probability of success of a simple algorithm for switched linear regression, Nonlinear Analysis: Hybrid Systems 8 (2013) 31–47, supplementary material available at http://www.loria.fr/~lauer/klinreg/.
-  T. Pham Dinh, H. Le Thi, H. Le, F. Lauer, A difference of convex functions algorithm for switched linear regression, IEEE Transactions on Automatic Control 59 (8) (2014) 2277–2282.
-  N. Ozay, M. Sznaier, C. Lagoa, O. Camps, A sparsification approach to set membership identification of switched affine systems, IEEE Transactions on Automatic Control 57 (3) (2012) 634–648.
-  H. Ohlsson, L. Ljung, S. Boyd, Segmentation of ARX-models using sum-of-norms regularization, Automatica 46 (6) (2010) 1107–1111.
-  H. Ohlsson, L. Ljung, Identification of switched linear regression models using sum-of-norms regularization, Automatica 49 (4) (2013) 1045–1050.
F. Lauer, V. L. Le, G. Bloch, Learning smooth models of nonsmooth functions via convex optimization, in: Proc. of the IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP), Santander, Spain, 2012.
-  J. Roll, A. Bemporad, L. Ljung, Identification of piecewise affine systems via mixed-integer programming, Automatica 40 (1) (2004) 37–50.
-  M. Garey, D. Johnson, Computers and Intractability: a Guide to the Theory of NP-Completeness, W.H. Freeman and Company, 1979.
-  D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Machine Learning 75 (2) (2009) 245–248.
-  V. Blondel, J. Tsitsiklis, A survey of computational complexity results in systems and control, Automatica 36 (9) (2000) 1249–1274.
-  R. Alur, N. Singhania, Precise piecewise affine models from input-output data, in: Proc. of the 14th Int. Conf. on Embedded Software (EMSOFT), 2014.
-  G. H. Golub, C. F. Van Loan, Matrix Computations, 4th Edition, John Hopkins University Press, 2013.
-  V. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998.