Dissecting Deep Neural Networks

10/09/2019 ∙ by Haakon Robinson, et al. ∙ Oklahoma State University NTNU 0

In exchange for large quantities of data and processing power, deep neural networks have yielded models that provide state of the art predication capabilities in many fields. However, a lack of strong guarantees on their behaviour have raised concerns over their use in safety-critical applications. A first step to understanding these networks is to develop alternate representations that allow for further analysis. It has been shown that neural networks with piecewise affine activation functions are themselves piecewise affine, with their domains consisting of a vast number of linear regions. So far, the research on this topic has focused on counting the number of linear regions, rather than obtaining explicit piecewise affine representations. This work presents a novel algorithm that can compute the piecewise affine form of any fully connected neural network with rectified linear unit activations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The recent successes of Machine Learning (ML) in image classification

[10], [29], [13]) and games like GO can be largely attributed to Deep Neural Networks (DNN) [9]. In particular, the ability to train extremely deep neural networks has yielded unprecedented performance in a myriad of fields. Examples of applications include diabetes detection [12], action detection for surveillance [31]

, feature learning for process pattern recognition

[34], denoising for speech enhancement [14], fault diagnosis [25], social image understanding [15], and low light image enhancements [16].

However, DNNs are still regarded as ”black box” models, and few guarantees can be made about their behaviour. The idea of adversarial attacks have exposed that many existing DNNs models have very low robustness. It has been shown that by changing the input minimally in a targeted way, DNNs can be tricked into giving a completely wrong output. Such attacks are sometimes limited to a single pixel [28]. This is especially problematic when applying DNN in safety critical applications like robotic surgery or autonomous cars. Therefore, the focus has now shifted to developing new methods for analyzing DNNs.

It has been shown that neural networks that only use piecewise affine (PWA) activation functions can themselves be expressed as a PWA function defined on convex polyhedra, although the number of regions can be enormous [7, 20]. A succint description of the structure and combinatorics of PWA neural networks can be found in chapter 7 of [27]. A function is PWA if it can be defined piecewise over a set of polyhedral regions :

(1)

Methods for counting the number of regions have been developed, but little research has been done into explicitly finding and working with the PWA representation of neural networks, likely due to the high complexity of such a function. This is unfortunate, as there is a wealth of literature on PWA functions, particularly in the context of modeling and control. For example, it is well known that the explicit solution to the linear Model Predictive Control (MPC) problem is a PWA function, and schemes for using PWA models in the optimisation loop exist [6]. Furthermore, there exist methods for verifying the stability of PWA systems, as well as stabilising them [18]. Positive invariant sets can be constructed for PWA systems by analysing the possible transitions between the linear regions of the system [5]. Thus, by decomposing a DNN into its PWA representation, these established methods can be used to obtain concrete stability results for a large and useful family of neural networks. Studies of the linear regions of neural networks started with the need to understand how expressive111A more expressive network has the ability to compute more complex, rich functions. these networks are, and how this changes with the architecture (number of layers, width of layers, etc) [19, 22]. Expressivity is often measured using the Vapnik–Chervonenkis (VC) dimension [32], and tight bounds have been found for the VC dimension of PWA neural networks [3]. Empirical measures for the expressivity of PWA networks have also been developed [23]. Empirical evidence strongly suggests increasing the depth of a network has a bigger impact on expressivity than increasing the width of existing layers [7, 30]. In [24] the authors present upper and lower bounds on the maximum number of regions that improve on previous results, along with a mixed-integer formulation from which the regions can be counted by enumerating the integer solutions. They established that for a network with input dimension , number of hidden layers , each with

nodes and ReLU activation, the asymptotic bounds for the maximal number of regions are:

(2)

This upper bound is exponential in both and . The most challenging aspect in terms of analysis is that the most useful neural networks are those with both large input dimension and large number of hidden layers . The number of linear regions of such a network is enormous. It is likely due to this that there have been a limited number of studies into the identification of these regions. There have been studies on approximating nonlinear neural networks with PWA functions [1]. Conversely, work has been done on the inverse problem of representing PWA functions more compactly as neural networks [33].

To this end, the main contribution of this work is an algorithm that can convert any neural network using fully connected layers and ReLU activations into its exact PWA representation that can be visualised and analysed, giving an insight into the inner workings of the network. This was achieved by utilising existing linear programming (LP) methods (specifically the MPT toolbox for MATLAB

©[11]

) for working with polyhedral sets and hyperplane arrangements. The approach can also be extended to any linear / affine layer (convolutional layers, batch normalisation), as well as any PWA activation function (Leaky ReLU, maxout).

Ii PWA functions and Neural Networks

Fig. 1: Neural network with fully connected layers with as an activation function. The fully connected layer takes in the output of the previous layer and produces , where . This value is then passed through the nonlinear activation function which operates element-wise on , yielding .

Neural networks consisting of linear / affine layers and piecewise affine (PWA), continuous activation functions are themselves PWA and continuous [7] (see Equation (1)). Note that any PWA function can also be written as a piecewise linear (PWL) function:

(3)

This can also be viewed as expressing in homogeneous coordinates. This allows chains of affine transformations to be written more compactly as a series of matrix multiplications.

A general neural network can be viewed as a graph of nodes (neurons) with weighted, directed edges, where some of the nodes are taken to be inputs, and some are outputs. Each neuron is associated with a scalar activation function

, and a scalar bias term . The value of any given neuron is , where is the sum of its weighted inputs, plus the bias term. The value of the neuron is passed on to other neurons through the outgoing edges. The values of the input neurons are set directly. In summary, for some neuron , its value is:

(4)

Where is the set of neurons that are connected to neuron , and is the weight of the edge from to . This general formulation presupposes no structure, and may include cycles. A more common architecture is to assume that the neurons are organised into layers that are connected adjacently, but not internally, as can be seen in Fig. 1. This is known as a feedforward neural network

. The advantage of this formulation is that the neurons of each layer can be grouped together into vectors

, and the propagation of the input through the network can be expressed as matrix operations, where the connection weights from Equation (4) are organised into a matrix .

(5)

Such a layer is called fully-connected, as every neuron in layer is connected to every neuron in layer . The size of the matrix can become intractable for large , for example when processing images. This issue can be mitigated by imposing some structure on the weights. For example, a convolutional layer assumes that different parts of the input will be processed similarly ( is redundant) such that Equation (5) can be replaced by a convolution operation. These types of layers have proven to be instrumental for networks that operate on images.

The activation function is chosen to be any nonlinear function that maps to some interval, and is applied element-wise to the layer input . Historically

has been selected as the sigmoid function

, as this resembles the action potential exhibited in biological neurons. However, the sigmoid function is associated with the vanishing gradient problem in deep networks, making it difficult to train [4, 21]. A popular activation function that mitigates this issue is the Rectified Linear Unit (ReLU), a PWA function.

(6)

Most neural network implementations generalise these networks as computational graphs that operate on higher-dimensional tensors

222Colour images can be represented with three dimensions, video with four (or five if sound is included)., which greatly benefits the efficiency of evaluation and training. The different types of layers are thus generalised as operations on tensors, which may potentially be nonlinear. As previously mentioned, the scope of this work is limited to linear layers such as fully connected and convolutional layers, and to PWA activation functions.

Consider a network with fully connected layers, as in Fig. 1. The operation of each fully connected layer is an affine transformation, as shown in Equation (5

). This can be converted to a linear transformation using Equation

3:

(7)

The last row of has been added to keep in homogeneous coordinates. Relaxing the notation a little, the composition operator is allowed to operate on matrix multiplications such that: . A feedforward neural network with input , output , and no branches can then be expressed as a series of alternating matrix multiplications and applications of .

(8)

It is now easy to see that without nonlinear activation functions, this network would simplify into a single matrix multiplication. The nonlinearity of is thus crucial to the representational power of neural networks. Equation (8) yields useful insights when attempting to obtain the PWA form of a neural network. To motivate this, consider a network with layers, each consisting of a single neuron with the following PWA activation function:

(9)

This activation function has two linear regions, separated by . The output of each fully connected layer is , and the resulting activation is called . The output of the network can be written:

(10)

The activation function thus splits its input into 2 separate regions. This expression can be expanded recursively, showing that the previous activation also splits its input in two, doubling the number of cases.

(11)

Continuing the expansion leads to different cases, each one corresponding to the set of signs of all , namely . More generally, must lie in one of the intervals that the activation function is defined on, thus determining what is. This active interval is defined as the activation of a neuron. The set of activations of all neurons in a network is called the activation pattern [24], denoted by . In the previous example, if the intervals of Equation (10) are defined as , then an activation pattern could have the form .

The activation patterns are a natural way to characterise the linear regions of a neural network. Given some , the corresponding case for each neuron can be selected (see Equation (9)), yielding the local PWA representation of the network. In terms of the previously introduced notation, this can be seen as substituting all in (8) with their corresponding linear transformations, allowing the whole chain to simplify into a single matrix multiplication:

(12)

However, not all activation patterns will be attainable by a given neural network. This can be seen in the next section in Fig. 3. The challenge is then to identify all the valid activation patterns , find the corresponding affine transformation , and assign to them the corresponding region of the input space. The situation is also complicated further when the layers are allowed to contain more than one neuron. The next section takes an iterative approach to this problem, by considering some simple examples that build on each other.

To this end, the main contribution of this work is an algorithm that can convert any DNN using fully connected layers and ReLU activations into its exact PWA representation that can be visualised and analysed, giving an insight into the inner workings of the network. This was achieved by utilising existing linear programming (LP) methods (specifically the MPT toolbox for MATLAB©, see[11]) for working with polyhedral sets and hyperplane arrangements. The approach can also be extended to any linear / affine layer (convolutional layers, batch normalisation), as well as any PWA activation function (Leaky ReLU, maxout).

Iii PWA Representation of a Simple Neural Network with ReLU Activations

Consider a neural network with 2 inputs and 1 hidden layer with 3 nodes and ReLU activation, as shown in Fig. 2(a). The general form of the network is:

(13)

The activation function is ReLU, as given in Equation (6). Equation (13) can then be written as:

(14)

The vector represents the row of . Each element of corresponds to the output of a neuron. Each neuron thus has two modes: one where the output is clipped to zero (because ), and one where the output is . The boundary between these two modes is given by , which defines a line. More generally, this boundary will be a hyperplane when there are more than two inputs. Each neuron thus bisects the input space, only outputting a positive, nonzero value if the input point is on the positive side of the corresponding boundary. To visualise this directionality, the boundaries are drawn with a shaded side, as can be seen in Fig. 2.

Fig. 2: Each node with ReLU activation has two modes: one where it is active and one where it is inactive. Boundaries are therefore drawn with a shaded side representing the inactive side.

The arrangement of these boundaries defines a set of polyhedral regions in the input space, each corresponding to a different activation pattern . From (13) it can be seen that an inactive neuron is equivalent to setting the corresponding row in the matrix to zero. The function in each region is therefore described by its own copy of the matrix, but with inactive rows being set to zero, as shown explicitly in Table I. Note that the activation function has been completely removed from the expression.

As there are no further layers, Table I is the complete PWA representation of the simple network in Fig. 2(a), where the regions are defined implicitly. Section IV describes how to find and how to explicitly represent the regions.

Activation pattern Computed PWA Function
TABLE I: The complete PWA representation for the network in figure 2(a). Each region computes its own transformation, with some of the rows zeroed out.

Adding an output layer with no activation is straightforward. The resulting network is shown in Fig. 2(b). The second layer has no activation, and computes the function:

(15)

where represents the transformation matrix corresponding to region , as shown in table I. The effect of adding another layer with no activation is just a matrix multiplication between the matrix of the new layer and the matrices of each region . This shows that adding layers without activation functions will not affect the linear regions.

(a) Single hidden layer
(b) Two hidden layers, no activation on the second
(c) Two hidden layers with activation
Fig. 3: The linear regions of three successively larger networks. Each neuron in the first hidden layer has boundary in the form of a line. The activation patterns corresponding to each region have been given as coloured dots, where the absence of a dot implies that a neuron is inactive. Adding a fully-connected layer with no activations does not affect the linear regions, as shown in (b). Adding an activation function to a fully-connected layer adds additional boundaries for each neuron, as shown in (c). Here there is only one neuron in the last layer. The boundary is different in each linear region of the first layer, but remains continuous. The boundary thus appears to bend when intersecting with the boundaries of previous layers.

An activation function is now added to the last neuron of the network in Fig. 2(b). The ReLU function deactivates the node in certain regions, switching it on and off. As before, there is a boundary that describes this switching behaviour. However, this time the input space consists of multiple regions defined by the previous layers. Importantly, the parameter matrix for each region is different, implying that the boundary introduced by the last node will be different for each region. The result will be similar to what is depicted in Fig. 2(c).

The new boundary is continuous across the boundaries of the previous layers, but it will ”bend” as it crosses them. This pattern continues as more layers are added, as new boundaries will bend when intersecting the boundaries of all previous layers. Another example of this structure is given in Fig. 4.

Fig. 4: Consider the activations boundaries of this new neural network. The boundaries of each neuron ”bend” at the boundaries of previous layers.

Applying the ReLU nonlinearity is relatively simple. Every region found so far is associated with its own unique matrix, which defines the output of each neuron in that region. The procedure from the first example can then be applied separately to each region, yielding the new set of regions. The subregions inherit their parents matrix, with the inactive rows zeroed out.

With this in place, the process of converting the network to its PWA form can be generalised to any number of layers. The missing piece is a method to find the regions defined by each layer.

Iv Finding the Linear Regions of a Neural Network

The previous examples demonstrated how the PWA representation may be obtained when the activation pattern is known, and described the structure of the linear regions. What now remains is to explicitly compute the regions, which are polyhedra. It is most convenient to define the regions using the hyperplanes themselves. This is known as the H-representation, where the region is defined as the intersection of the half-spaces defined by the hyperplanes [8]. If the bounding hyperplanes have indices , then the polyhedron can be written as:

(16)

Alternatively, this can be written as the matrix inequality:

(17)

In particular, the matrix representation allows one to easily test whether a point is contained within the region, to compute an internal point by finding the Chebyshev centre, and can be used to check for intersections between polytopes of different dimensions [2, 11]

. A drawback is that there may be redundant constraints, which can slow down later operations. It is also generally expensive to identify and remove the redundant constraints, as this involves solving an LP for each hyperplane, although there exist heuristics to reduce this number

[17].

The issue of finding the regions defined by the neuron boundaries of a layer with PWA activation is equivalent to finding the regions of a hyperplane arrangement. A compact and rigorous discussion of hyperplane arrangements is given in [26]. An important result is an upper bound on the maximal number of regions for hyperplanes in [35].

(18)

This expression grows quickly with both and , but not exponentially. A surface plot of (18) is given in Fig. 5. In practice the number of regions will be significantly less than this.

Fig. 5: Growth of Zavslavsky’s upper bound for the number of regions in a hyperplane arrangement.

The regions may be found by iteratively bisecting a growing collection of regions, as illustrated in Fig. 6. This is done by adding the bisecting hyperplane as a constraint to the H-representation of the “parent” region. If the bisecting hyperplane is given by , and the region is represented by , then the new matrices of the two subregions will be:

(19)

If the hyperplane does not intersect the parent region, then the previous matrices describe empty sets. The intersection may be checked with a quick feasibility LP using Equation (17) as a constraint. The hyperplanes can then be considered one by one, checking for intersections with all of the regions found so far, and bisecting when there is an intersection.

The main source of complexity associated with this procedure is the increasing number of regions that must be checked for intersections. The search space can be reduced significantly by retaining the parent regions found after each iteration and checking these instead. If a hyperplane does not intersect a region, then it will not intersect any of its subregions either. This procedure is shown in Alg. 1.

fn get_regions(initial region , hyperplanes ) is
          fn search(region , hyperplane ) is
                   /* Extract and from */
                   if  then
                            if  has no children then
                                     left_child() right_child()
                           else
                                     search(left_child(), ) search(right_child(), )
                            end if
                           
                  
          end
         foreach  do
                   search(, )
          end foreach
         return
end
Algorithm 1 Procedure to obtain regions of hyperplane arrangement
Fig. 6: Illustration of a procedure for finding the regions of a hyperplane arrangement. Each hyperplane is considered in turn, and is used to bisect the previously found regions by adding it to their H-representations. It is necessary to search the previously found regions for intersections. The search space can be significantly reduced by checking the parent regions first. This can be accomplished by storing the regions in a binary tree structure, adding new nodes every time a region is bisected by a hyperplane.

V Algorithm

As shown in the examples, a neural network can be converted to its PWA representation in an iterative fashion, starting at the input layer. Each subsequent layer is then applied to each of the previously found regions. If the layer is a linear transformation (fully connected layer), the PWA transformation in each region is modified. If the layer is a ReLU activation function, each region is further subdivided by finding the node boundaries and solving the hyperplane arrangement problem.

The set of currently known regions and their corresponding matrices is called the working set . Every element in will be a tuple of the form , where is a polyhedral region and is a matrix that defines the affine transformation computed within that region.

The neural network itself is represented as a sequence of nodes, which can either be fully connected layers (represented as the linear transformation ) or ReLU activations. The algorithm is presented in Alg. 2.

Result:
fn pwa(network ) is
          working set foreach layer  do
                   case  fully connected layer with transformation  do
                            for  do
                                    
                            end for
                           
                   case  ReLU activation do
                            for  do
                                     subregions get_regions(, ) for subregion  do
                                              = interior_point() inactive rows subregion matrix , with inactive rows set to zero
                                     end for
                                    
                            end for
                           
                   end case
                  
          end foreach
         
end
Algorithm 2 Convert neural network to its PWA representation

As the size of the working set will increase after processing each layer, it is clear that the worst case performance of the algorithm will depend greatly on the total depth of the network. However, it is not clear how quickly the working set will grow. For example, some regions in the working set may be intersected multiple times by the node boundaries in the next layer, while others will not be intersected at all. Despite this, the problem is inherently parallelisable. When parsing a layer of the network, the hyperplane arrangement problem is solved separately for each region in the working set, allowing for significant speedups when many cores are available. As will be shown in the next section, the number of regions in the working set quickly becomes very large, suggesting that the algorithm could benefit greatly from GPU hardware acceleration.

Vi Results

All runtimes were measured using a machine with a 6-core, 3,5 GHz processor and 16 GB of RAM. The polyhedral computations described in previous sections were performed using the MPT toolbox for MATLAB©. The results for get_regions() in terms of the number of hyperplanes have been presented together in Fig. 7(a). The runtimes for get_regions() are also presented in terms of the number of regions in Fig. 7(b), showing that the runtime is roughly proportional to the number of regions found. More surprisingly, the effect of increasing the input dimension (and thus the size of the required LPs) is almost negligible in comparison. This suggests that it is the high number of calls to the LP solver, rather than the size of the LPs, that dominates the time complexity of get_regions(). As the LPs are generally quite small, choosing an LP solver with a low amount of presolving might yield significant improvements. The author’s implementation used the default LP solver included with MATLAB©(linprog()), which is often outperformed by other solvers.

The runtime of the main algorithm was measured with and without parallelisation on the available 6 cores. The runtime as a function of the number of regions of the final network is shown in Fig. 7. Networks with an input dimension of up to four were processed, as the number of regions quickly exploded and the runtimes became intractable for networks with larger input dimensions. The runtime increases exponentially with the size of the network. Parallelisation was very effective, with the performance increasing by a factor approaching the number of cores used (6 cores). The per-region cost decreases with the input dimension, suggesting that there is an efficiency gain when increasing the input dimension. However, for any given number of regions, the corresponding points on the lines represent networks of very different sizes. For example, a network with two inputs and three hidden layers with width 10 might have a similar number of regions as a network with four inputs and two hidden layers with width 5. However, the first network will likely take longer to convert due to it having an additional hidden layer, and wider layers overall.

Fig. 7: Runtime of the main algorithm against the number of regions found, with and without parallelisation.
(a) Runtime of get_regions() against the number of hyperplanes
(b) Runtime of get_regions() against the number of regions
Fig. 8: The runtime of get_regions() increases significantly with the number of nodes/hyperplanes, and appears to be somewhat linear with respect to the number of regions found. Increasing the input dimension increases the number of regions significantly, and adds a small, per-region cost that scales with the size of the arrangement.
Fig. 9: True and simulated trajectory using the neural network with . The network displays some asymmetries in its trajectory, suggesting that the learned pendulum would swing slightly higher on one side. It also appears to converge slightly off-centre of the origin. This is due to the fact that the neural network does not assume energy conservation.

As previously mentioned, PWA functions are widely used to represent complex dynamical systems. Neural networks are not as commonly used due to the difficulty of reasoning about their behaviour. However, it is possible train a neural network on dynamical data and then retrieve its PWA form. The algorithm is now applied to a neural network with 2 inputs, so that each of its outputs can be plotted separately as a surface. The linear regions of the network can then also be plotted in the plane. The neural network was given 2 hidden layers, with 15 and 5 neurons respectively. The network was trained on the following mapping, which describes the dynamics of a damped pendulum for = 9.81, = 1, = 5, and = 0.1.

(20)

This can be reformulated as a system of first order ordinary differential equations (ODE) where

and :

(21)
(a) Linear regions (116 total) of the pendulum neural network
(b) First output of the network:
(c) Second output of the network:
Fig. 10: The complete PWA form of the pendulum dynamics neural network. The linear regions appear to have arranged themselves in patterns that support the shape of the output.

A training dataset was created by sampling and

50000 times from the continuous uniform distribution

and the normal distribution

respectively, creating a sample of states . The corresponding was then found through Equation (21

). Then the neural network was trained on the data using the Adam (derived from “adaptive moment estimation”) optimiser with a learning rate of 0.003 for 50 epochs, finally achieving a root mean square error (RMSE) of

. The true and learned dynamics were then simulated using the MATLAB©function ode45(). Fig. 9 compares the two. The complete PWA form of the network is shown in Fig. 10 as a pair of surface plots, along with the 116 linear regions. Interestingly, the linear regions show a concentration of horizontal boundaries around , along with several vertical boundaries. Because the network is locally linear, the boundaries determine any changes in gradient. It is therefore likely that the concentration of horizontal boundaries serves to give the two outputs a constant slope in the direction (see Fig. 9(b) and 9(c)). Likewise, the vertical boundaries form large sheets that are arranged in a sinusoidal shape that approximates Equation (20). It is interesting to see such structure emerging as a result of the training process. However, there is still a bit of irregularity due to the large number of small regions between closely packed boundaries. These small regions are numerous, but highly redundant as they don’t contribute significantly to the shape of the output. This can prove to be challenging when attempting to analyse the stability of such a representation, which typically involves keeping track of possible state transitions between regions, for example when using energy methods [18]. It may therefore be desirable to take steps to simplify the PWA representation either during or after the training process by merging boundaries that appear redundant, or by introducing new boundaries. This could be done by adding some kind of regularisation that forces similar connection weights for neurons in the same layer to converge together. The architecture of a network could then be simplified by merging neurons with very similar weights. Likewise, if the network is performing badly in a particular region of the state space, neurons can be split in two, introducing additional boundaries.

Vii Conclusion

A reasonably efficient algorithm that can obtain the PWA representation of a neural network using ReLU activation functions was presented. Results demonstrating conversions of randomly initialised neural networks with up to four dimensions and three layers were reported, the largest of which had 31835 linear regions. A parallelised version of the algorithm was able to perform this conversion in around a minute on a standard desktop computer. With more computational resources, as well as further optimisations to the algorithm, it is clear that much larger networks will be able to be converted. While this paper demonstrated how to perform the conversion for networks with fully connected layers and ReLU activations only, the approach may be generalised to any linear layer and arbitrary PWA activation functions (for example, leaky ReLU). This includes convolutional layers, normalisation layers, and networks with more complex branching architectures, which encompasses a large fraction of popular architectures in use today. The input dimension of the network appears to be a large source of complexity, limiting this approach to networks with fewer inputs. Using the method together with dimensionality reduction techniques is an approach that shows great promise for the study of complex systems that resist analysis.

References

  • [1] H. Amin, K. M. Curtis, and B. R. Hayes-Gill (1997) Piecewise linear approximation applied to nonlinear function of a neural network. IEE Proceedings-Circuits, Devices and Systems 144 (6), pp. 313–317. Cited by: §I.
  • [2] M. Baotic (2009) Polytopic computations in constrained optimal control. Automatika 50 (3-4), pp. 119–134. Cited by: §IV.
  • [3] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian (2019) Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.. Journal of Machine Learning Research 20 (63), pp. 1–17. Cited by: §I.
  • [4] Y. Bengio, P. Simard, P. Frasconi, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §II.
  • [5] H. Benlaoukli, M. Hovd, S. Olaru, and P. Boucher (2009-12) On the construction of invariant sets for piecewise affine systems using the transition graph. In 2009 IEEE International Conference on Control and Automation, Vol. , pp. 122–127. External Links: Document, ISSN Cited by: §I.
  • [6] A. L. Cervantes, O. E. Agamennoni, and J. L. Figueroa (2003) A nonlinear model predictive control system based on wiener piecewise linear models. Journal of process control 13 (7), pp. 655–666. Cited by: §I.
  • [7] R. Eldan and O. Shamir (2015) The power of depth for feedforward neural networks. CoRR abs/1512.03965. External Links: Link, 1512.03965 Cited by: §I, §I, §II.
  • [8] K. Fukuda et al. (2004) Frequently asked questions in polyhedral computation. Swiss Fedral Institute of Technology. Cited by: §IV.
  • [9] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §I.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §I.
  • [11] M. Herceg, M. Kvasnica, C.N. Jones, and M. Morari (2013-July 17–19) Multi-Parametric Toolbox 3.0. In Proc. of the European Control Conference, Zürich, Switzerland, pp. 502–510. Note: http://control.ee.ethz.ch/~mpt Cited by: §I, §II, §IV.
  • [12] K. Kannadasan, D. R. Edla, and V. Kuppili (2018)

    Type 2 diabetes data classification using stacked autoencoders in deep neural networks

    .
    Clinical Epidemiology and Global Health. External Links: ISSN 2213-3984 Cited by: §I.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §I.
  • [14] H. Liu, Y. Tsao, and C. Fuh (2018)

    Bone-conducted speech enhancement using deep denoising autoencoder

    .
    Speech Communication 104, pp. 106 – 112. External Links: ISSN 0167-6393 Cited by: §I.
  • [15] J. Liu, S. Wang, and W. Yang (2019) Sparse autoencoder for social image understanding. Neurocomputing. External Links: ISSN 0925-2312 Cited by: §I.
  • [16] K. G. Lore, A. Akintayo, and S. Sarkar (2017) LLNet: a deep autoencoder approach to natural low-light image enhancement. Pattern Recognition 61, pp. 650 – 662. External Links: ISSN 0031-3203 Cited by: §I.
  • [17] A. Maréchal and M. Périn (2017) Efficient elimination of redundancies in polyhedra by raytracing. In International Conference on Verification, Model Checking, and Abstract Interpretation, pp. 367–385. Cited by: §IV.
  • [18] D. Mignone, G. Ferrari-Trecate, and M. Morari (2000-12) Stability and stabilization of piecewise affine and hybrid systems: an lmi approach. In Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187), Vol. 1, pp. 504–509 vol.1. External Links: Document, ISSN Cited by: §I, §VI.
  • [19] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio (2014) On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932. Cited by: §I.
  • [20] G. Montúfar (2017) Notes on the number of linear regions of deep neural networks. In SampTA, Cited by: §I.
  • [21] R. Pascanu, T. Mikolov, and Y. Bengio (2013)

    On the difficulty of training recurrent neural networks

    .
    In International conference on machine learning, pp. 1310–1318. Cited by: §II.
  • [22] R. Pascanu, G. Montufar, and Y. Bengio (2013) On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098. Cited by: §I.
  • [23] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein (2017) On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 2847–2854. External Links: Link Cited by: §I.
  • [24] T. Serra, C. Tjandraatmadja, and S. Ramalingam (2018) Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pp. 4565–4573. Cited by: §I, §II.
  • [25] H. Shao, H. Jiang, H. Zhao, and F. Wang (2017) A novel deep autoencoder feature learning method for rotating machinery fault diagnosis. Mechanical Systems and Signal Processing 95, pp. 187 – 204. External Links: ISSN 0888-3270, Document, Link Cited by: §I.
  • [26] R. P. Stanley (2006) An introduction to hyperplane arrangments. University of Pennsylvania. Cited by: §IV.
  • [27] G. Strang (2019) Linear algebra and learning from data. Wellesley-Cambridge Press. External Links: ISBN 9780692196380, Link Cited by: §I.
  • [28] J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    (), pp. 1–1.
    External Links: Document, ISSN Cited by: §I.
  • [29] C. Szegedy, S. Ioffe, and V. Vanhoucke (2016)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    CoRR abs/1602.07261. External Links: Link, 1602.07261 Cited by: §I.
  • [30] M. Telgarsky (2015) Representation benefits of deep feedforward networks. CoRR abs/1509.08101. External Links: Link, 1509.08101 Cited by: §I.
  • [31] A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik (2019) Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Generation Computer Systems 96, pp. 386 – 397. External Links: ISSN 0167-739X Cited by: §I.
  • [32] V. Vapnik and A. Chervonenkis (1971)

    On the uniform convergence of relative frequencies of events to their probabilities

    .
    Theory of Probability & Its Applications 16 (2), pp. 264–280. External Links: Document, Link, https://doi.org/10.1137/1116025 Cited by: §I.
  • [33] S. Wang, X. Huang, and K. M. Junaid (2008) Configuration of continuous piecewise-linear neural networks. IEEE Transactions on Neural Networks 19 (8), pp. 1431–1445. Cited by: §I.
  • [34] J. Yu, X. Zheng, and S. Wang (2019) A deep autoencoder feature learning method for process pattern recognition. Journal of Process Control 79, pp. 1 – 15. External Links: ISSN 0959-1524 Cited by: §I.
  • [35] T. Zaslavsky (1975-01) Facing up to arrangements: face-count formulas for partitions of space by hyperplanes. Memoirs of the American Mathematical Society 154, pp. . External Links: Document Cited by: §IV.