The success of Deep Learning(Bengio et al., 2021)
is enabled by the ability of neural networks to learn abstract concepts from complex data. Once learned, these abstract concepts can simplify the solution of a wide variety of problems. Often these concepts correspond to unobservable or even unknown quantities. Famous examples are BERT(Devlin et al., 2019)
for natural language processing problems and ResNet(He et al., 2016)
for computer vision tasks. However, concept learning and transfer learning are hardly used for CPS.
CPS are characterized by the interaction between a computational entity and a physical process (Muhuri et al., 2019). In most cases, CPS data are available in the form of multivariate time series. If deep learning models were available that could extract or identify physical concepts from this sensor data, simpler solutions could be found for tasks such as predictive maintenance (Nguyen and Medjaher, 2019)2015) or diagnosis (Zhang et al., 2019). Examples of such physical concepts would answer the following questions: Can the large number of sensors be represented by a small number of variables that describe the state of the system? If so, how can these variables be interpreted? And can the behavior of the CPS over time be described in a simple way? These questions motivate this paper and lead to the following research questions: (i) Can the benefit of RepL methods be demonstrated on CPS examples? What are useful concepts for CPS models? How are concepts defined for time series? (ii) What are the advantages and disadvantages of these methods and to which CPS use cases they can be applied?
The remainder of this paper is organized as follows: In Section 2, we provide an overview of the current state of research related to solving research question (i). Based on this, in Section 3 we present and discuss selected methods that we consider promising. For this purpose, we draw on the often-used example of the three-tank system (see Figure 1), which is well suited to answer research question (ii) for two reasons: First, the complexity is low and the concepts are therefore relatively easy to understand. Second, it can be easily extended incrementally in complexity up to the Tennessee Eastman Process (Balzereit et al., 2021). These example implementations are then used to discuss the usability as well as the advantages and disadvantages of the respective algorithms in Section 4. For this, we evaluate the methods with respect to their applicability to typical CPS datasets, the interpretability of the respective results, and the amount of prior knowledge required. In addition, we consider the application of the methods for typical CPS use cases. Finally, Section 5 concludes this paper.
2 State of the Art
The task of learning physical concepts using ML methods can be seen as a subset of the very active and relatively new research area of RepL. The core motivation of RepL is to build models which are capable of encoding real world observations of (physical) processes, into meaningful representations (Bengio et al., 2013). Good representations both increase transparency and simplify downstream tasks such as predictions or classifications. Often, this means that a multivariate time series , where notes the sequence length and
the number of sensors, is mapped to a vectorsuch that encodes the disentangled explanatory factors of the observation . In most cases, computing representations is accompanied by a dimensionality reduction. The idea of dimensionality reduction is already very old. Methods such as PCA (Hotelling, 1933) and MDS (Kruskal, 1964) have been used successfully for decades. New methods for dimensionality reduction, such as Stochastic Neighbor Embedding (Hinton and Roweis, 2002), t-SNE (van der Maaten and Hinton, 2008), and UMAP (McInnes et al., 2018) are also frequently applied in the field of CPS.
However, these methods seldom encode actually meaningful representations and even more rarely physical concepts. For this purpose, strong priors are required. In modern RepL, several of these priors are used and studied, see (Bengio et al., 2013)
. A common prior is to assume simple and sparse dependencies between a small number of underlying factors that explain the variation in the observed data. Probably the best known example is the assumption that the underlying explanatory factors are statistically independent. This prior lies at the core of deep generative models such as Variational Autoencoders (VAEs)(Kingma and Welling, 2013)
and Generative Adversarial Networks (GANs)(Goodfellow et al., 2014). VAEs and GANs have been supplemented and improved in many ways over the past few years (Higgins et al., 2017; Kim and Mnih, 2018; Chen et al., 2018, 2016; Zhao et al., 2019)
. While corresponding methods have achieved impressive success in RepL for high-dimensional data such as images or videos, stronger priors are required for identifying physical concepts, especially in complex CPS datasets.
Within the field of RepL, there has been a number of works trying to learn discrete representations. Popular examples include restricted Boltzmann machines(Salakhutdinov and Hinton, 2009) and VQ-VAE (van den Oord et al., 2017). Discrete representations are a good fit for CPS, as the system’s behavior can often be described by underlying states. Furthermore, discrete representations are more interpretable and thus simplify supervision. By applying these methods on time series, the dynamic behavior of the underlying system can be analyzed, as demonstrated by SOM-VAE (Fortuin et al., 2019)
. Successful applications on CPS include the use of self-organizing maps for predictive maintenance(von Birgelen et al., 2018)
and restricted Boltzmann machines for anomaly detection(Hranisavljevic et al., 2020).
. The authors present a method that contains a number of decoder and encoder neural networks (called agents) that exchange information among each other. While the encoding agents map the observations into a representation, the decoding agents perform different subtasks using the representations as input. These subtasks are chosen such that each decoding agents requires a different subset of underlying explanatory factors to perform their task. By means of a special loss function, the communication between the agents is minimized, which disentangles the individual variables in the latent space.
The literature also provides ML approaches that aim at extracting physical concepts in the sense of simple and sparse symbolic formulas. An important example is Symbolic Regression (SR), where the aim is to find a symbolic expression that maps some given input to some output data. A very popular solution is described in (Schmidt and Lipson, 2009) and implemented in the Eureqa software. More recently, a new methods called AI Feynman was introduced (Udrescu et al., 2020; Udrescu and Tegmark, 2020). A use case for SR, which is very promising especially in the context of CPS, is the discovery of dynamic systems. E.g. Brunton et al. (2016) propose a method called SINDy, which uses SR methods to discover parsimonious models for non-linear dynamic systems.
An important assumption of the SR based methods mentioned above, however, is that the sparse symbolic expression can be found in the coordinate system in which the observations are measured. A new field of ML research is emerging around the question of how suitable coordinate system and dynamic systems can be learned simultaneously. Champion et al. (2019) e.g., use an autoencoder architecture in combination with the SINDy algorithm. The autoencoder maps the observations from some high-dimensional space into a coordinate system that enables a parsimonious representation of the system dynamics in the latent space. Because linear dymanic systems are of great advantage over non-linear systems for control engineering and prediction, there are also approaches that allow the identification of linear dymanic models in latent space (Lusch et al., 2018). This approach is very closely related to the Koopman operator theory. A recent and comprehensive literature review on this topic is provided by Brunton et al. (2021).
To the best of our knowledge, there is no paper systematically evaluating the application of current deep RepL methods in the field of CPS.
In this section, we analyze the application of a selection of the methods mentioned in Section 2 in the field of CPS. For this purpose, we consider the example of the three-tank system (see Figure 1). The dynamics of this system with respect to the fill levels of the tanks can be described as follows (Kubalčík and Bobál, 2016):
where and are the time dependent fill levels of the corresponding tanks, and are the flow rates of the pumps, and are the coefficients of the valves and respectively and is some system specific constant. Ideally, the methods would identify explanatory but unobservable physical quantities such as the inflow or valve coefficients based on observations of this system. Possibly, even the system dynamics or process phases like “mixing” or “filling” could be identified.
3.1 Solution 1: Seq2Seq variational autoencoder
As mentioned in Section 2, deep generative models such as VAEs and GANs can be trained to extract the underlying factors of variation in data. These models often approximate the joint probability , where represents the observations and the latent space variables, which encode the disentangled explanatory factors of variation in . The core assumption in this approach is that the underlying factors of variation in the data also describe the underlying physical concepts causing the observations. In this case, the physical concepts would be encoded in the form of latent variables .
To demonstrate this method, we simulate a dataset based on the dynamics described in Equation (1). The dataset contains different time series of lenght , each describing an independent process of the three-tank system. The individual time series differ in the values of the inflows and as well as in the valve coefficients and . We sample these quantities randomly from a given interval. Figure 2 shows two example time series from this dataset.
Using this dataset, we train a -VAE (Higgins et al., 2017)
. We have adapted the solution from the original paper to better handle time series data by using Gated Recurrent Units (GRU)(Cho et al., 2014) in the encoder and decoder with parameters and respectively (see Figure 3). Given the continuous nature of the concepts we are interested in, we choose a Gaussian prior such that and .
Evaluating the quality of representations is not trivial in general. In our example, however, we are able to compare the mappings learned by the model with the actual physical factors underlying the data. In the case of good representations in our sense, a strong correlation should be observed between one of the estimated mean parametersof the distributions and one of the actual underlying quantities and
. In other words, each of the physical concepts should be reflected in the activation of one of the latent neurons. At the same time, this activation should be as indifferent as possible to changes in the other physical quantities. This can be seen to some extent in Figure4.
We can observe that one of the five latent variables encodes almost no information from the input. The distribution of this variable is very close to the prior distribution . This makes sense, since it is only four independent variables that cause the changes in the data. A clear correlation can be seen in three of the subplots, showing the scatter plots of the pairs , and , which leads to the assumption that these variables encode the respective physical concepts to some extent. However, no clear disentanglement of the individual concepts emerges, although the corresponding conditional likelihood is very high.
As this experiment shows, deep generative models can be trained in a purely data-driven manner to learn representations of time series data. The fact that the prior knowledge required for training is limited to the choice of the a priori distribution and the number of latent neurons is a clear advantage of this method. Another advantage is that this method can also be applied to high-dimensional time series data, which are often available in the CPS environment. A key disadvantage of this approach, however, is that it is not specifically designed to identify physical concepts, but rather to identify those factors that explain the variation in the observations. If, for example, the behavior of a CPS was primarily dependent on its control variables, it would be these control variables that would have been captured in the latent space variables. Another disadvantage of this approach is that the concepts learned are limited to physical quantities. Concepts such as event sequences or formulas cannot be identified.
3.2 Solution 2: Communicating agents
As mentioned in Section 2, in most cases strong priors are needed to extract actual physical concepts from CPS data. In this section we demonstrate and validate a method introduced by Nautrup et al. (2020). Their solution essentially consists of three components (see Figure 5). (i) There is (at least) one encoder neural network that maps the observations into a lower dimensional space with using a deep neural network. (ii) There are decoders that solve different tasks using the latent representation of the observations as input. The authors call these tasks “questions”. These could for example be regression problems. (iii) A special filter component , which limits the amount of information transferred from the encoder to each individual decoder. Intuitively, the filter can be thought of as adding random noise to the latent space variables before transferring them to the decoders. The loss function maximizes the amount of noise added, while minimizing the mean squared error of the regression problems solved by the decoders.
To generate data for this experiment, we use the same simulation as in Subsection 3.1. However, training this method also requires the dataset to contain what the authors call “questions” and “answers” , which are essentially labels. Training in this constellation only helps to disentangle the latent variables if different physical concepts are required to answer the different questions. For our experiment, we assume that it is possible to answer the following questions for each training sample: Given some flow rate (or ), what is the time it takes to fill up Tank 1 (or Tank 3) when all the valves are closed ( and )? And if either Tank 1 (or Tank 2) is completely filled and all valves are opened, how long will it take for the system to drain ( and )? To generate the complete dataset, we compute the answers to these four questions in advance. In contrast to the authors of the original paper, we use GRUs in the encoder neural network.
As in the example from Subsection 3.1, we can compare the learned representations with the actual underlying physical quantities. In Figure 6, we see that the correlations between the activations of the latent space neurons and the physical concepts are much stronger in this experiment than in the previous one. This result was to be expected, because with the dataset enriched with questions and answers, a stronger prior was set to disentangling the latent variables.
The experiment shows that the method produces a better disentanglement between the individual concepts underlying the data than the VAE. Thus, the possibility of incorporating prior knowledge through the design of experiments, including questions and answers, has advantages. However, the need for datasets that include tuples of questions and answers in addition to the observations is also the major drawback of this approach, as performing experiments in CPS is rarely feasible. In addition, as with the VAE, only physical quantities can be identified, but not their interaction or temporal sequences.
3.3 Solution 3: Dynamic system identification
In this section, we will use the SINDy and the Autoencoder-SINDy method to show how the system dynamics of our example system can be identified. For both methods, in addition to the observations , the derivatives must also be available or computed numerically prior to the model training. The SINDy algorithm basically performs a sparse regression to approximate , where is one snapshot in time and one row of . The method allows the user to define a library of possible candidate functions , which are used to create a feature matrix . This matrix is used to identify sparse parameters of the coefficient matrix such that (see Figure 7 ). The selection of the candidate functions allows the user to integrate a certain amount of prior knowledge into the model. This method assumes that a dynamic model can be identified in the variables that are being observed. The chances to find a sparse model in the high-dimensional sensor data of a CPS is thus rather small.
Therefore, the Autoencoder-SINDy algorithm assumes that there is a non-linear coordinate transformation that allows the formulation of the system dynamics in a simple mathematical expression . With the encoder and the decoder , the system dynamics in the observation space can be written as .
Equation (2) shows the loss function of the Autoencoder-SINDy setup. and are the weights for the SINDy-loss in , SINDy-loss in and the regularization penalty respectively.
During training, the weights of and as well as the coefficient matrix are optimized with respect to the loss function above.
For this example, we also use the dynamical system (Equation (1)) to simulate training data. We assume that all valves except valve 3 are open and that the system is in some random initial state . To generate the complete dataset we sample 1000 different initial conditions and run the simulation for 50 time steps and obtain a total dataset of 50000 training samples. For the SINDy experiment, we assume that the levels of the tanks can be observed. For the Autoencoder-SINDy experiment on the other hand, we assume that we observe the simulation in a higher dimensional space. To generate this high-dimensional observation, we use a polynomial transformation of degree 5 such that .
For the SINDy experiment we used the Python implementation available on GitHub (de Silva et al., 2020). As expected, the quality of the results strongly depends on the choice of candidate functions. If we include in the candidate functions library in addition to the typical candidates such as polynomials and trigonometric functions, we can very quickly identify Equation (1). However, if this rather specific function is not part of the candidate library, no sparse form for the dynamics can be identified.
The Autoencoder-SINDy model has identified the following equation for the dynamics in the latent space:
This system equation differs significantly from Equation (1). However, it is worth noting that the model has found a coordination system in that only two out of three latent variables have a nonzero change over time. This makes sense because, for example, the differences of the tank levels and would theoretically suffice to describe the dynamics of the system in our experiment. To validate the quality of the model, we conduct the following experiment: First, we take some initial value and compute . We then solve the ODE in Equation (3) to get a time series of the length . Finally we transform each value in the resulting time series back to the observation space using . As Figure 8 shows, the resulting time series is very close to the original behavior of the system.
This experiment shows that it is possible to extract simple formulas for dynamical systems from the observations in the case of both observable and unobservable system states. In particular, the result of the SINDy experiment, namely the correct underlying differential system equation, increases the interpretability and scalability of predictive models compared to standard neural network approaches. The selection of the candidate functions allows the integration of prior knowledge into the model. However, both methods are very sensitive to the choice of candidate functions and the hyperparameter settings.
3.4 Solution 4: State identification
The state encodes the system’s current behavior in a low-dimensional representation. Ideally, this representation has a topologically interpretable structure, i.e. states that are close are more similar. Such a topological structure can be induced by self-oganizing maps (SOMs)(Kohonen, 1990). A popular time series clustering method that builds on this approach is SOM-VAE (Fortuin et al., 2019). The model first uses an encoder to generate a representation from an observation . In the latent space, the encoding is assigned to its nearest embedding vector . The embedding vectors are randomly initialized and have a pre-defined number and relative position. These embedding vectors function as a representation of the underlying state. At its core, the model is an autoencoder, thus a decoder uses both the encoding and the embedding vector to reconstruct the input, resulting in and respectively. While training, the model adjusts both encoding and embedding vectors in order to minimize the reconstruction loss, which is the first term of the loss function, see Equation (4). Furthermore, the encoding should be similar to its assigned embedding vector, which is handled by the commitment loss, the second term in the loss function. Lastly, the topological structure induced by the SOM has to be learned. In essence, the neighbors of the assigned embedding vector are pulled towards the encoding of the input. Crucially, the encoding does not receive information about these other embeddings, which is noted by the gradient stopping operator :
To simulate changing states in the three-tank system described in Equation (1), the values of the flow rates and the valve coefficients are changed periodically in a fixed sequence. The process can be seen in Figure 9, where the tanks are filled and mixed in stages, until the whole fluid is released. In total there are four different states: the tanks are filled, the flow is stopped with closed valves, the fluids of the tanks are mixed and finally all valves are opened, which causes the tanks to empty. The first three states are repeated three times in a row, followed by the last state. This sequence is repeated for 100 times, resulting in a total of 145000 time steps. To generate the training dataset, 10000 samples with a window size of 100 are sampled from the dataset randomly. After training the model, we iterate over the test dataset to generate the predicted state at every time step. To ensure that the model does not use future values, at every step the model only has access to the past 100 time steps, thus simulating a live prediction. Using this dataset, we train a SOM-VAE with the encoder and decoder implemented as fully connected dense neural networks. The model is given the possibility to assign a total of six different states, which are ordered in a grid.
In the generated dataset, the systems cycles through four different states, where the tanks are filled, mixed and released repeatedly. While the underlying settings of the flow rate and the valve coefficients are the same in every respective state, the fill level of the tanks accumulate, which means that the same state can have different fill levels. Ideally, a model should be able to detect the underlying state regardless of the differences in the fill levels.
The predicted states at each time step can be seen in Figure 9. The first thing to notice is that the predictions have a small time lag. That is expected, as the model only receives information about past values and needs some time to adapt to changes. The model detects the cycling through the filling, resting and mixing phases and switches between State 4 and 5 repeatedly. It can thus to some extend learn to generalize the system’s behavior, as the individual fill levels differ in between each phase. However, the model struggles to differentiate between the filling and the mixing phase, which might occur too quickly. The model was given the task to assign six states, while the underlying system only has four. Some states thus do not seem to encode relevant information and pass quickly.
The model is able to detect changing states of the system in a fully unsupervised matter. Not only does it assign a sample to a state, it learns an embedding that represents the system’s current behavior and that could be used in downstream tasks. Furthermore, the transition of states can be analyzed to gain insights in the temporal structure of the system, as demonstrated in the original SOM-VAE paper. While the underlying change of states has been detected, the predictions have a small time lag and quick transitions have not been detected properly. Furthermore, the number of possible states an individual model can learn is fixed in beforehand, thus some testing might be necessary if the true number of states cannot be estimated.
This section discusses the usefulness of concept learning for CPS. We compare the methods demonstrated above on the basis of the following three core criteria: (i) applicability on CPS data, which describes the ability of the methods to handle real-life CPS data, including e.g. noisy measurements, discrete system state changes, and hybrid data forms (discrete and continous), (ii) required prior knowledge, which describes the amount of prior knowledge needed to apply the method, and (iii) interpretability, the degree to which the results of the methods are interpretable.
The comparison is summarized in Table 1, where indicates a high performance of the solution (e.g. highly interpretable results or no prior knowledge needed). and indicate medium and low performance respectively.
Leaving aside the necessary prior knowledge, it can be said that Solutions 1, 2 and 4 can be applied to typical CPS datasets very well. These solutions have in common that the encoder can theoretically extract the representations from any complex time series dataset. In contrast, Solution 3 assumes a continuous dynamical system underlying the observations. However, in most CPS datasets there will be discrete sensor signals and externally triggered discrete state changes. With regard to the prior knowledge needed, Solution 2 stands out, as it requires labeled datasets which are hard to collect in a real-life CPS. Solutions 1, 3, and 4 only require prior knowledge in the form of hyperparameters such as the choice of candidate functions (Solution 3) or the number of states (Solution 4). The representations offered by Solutions 2 and 3 are most interpretable, as they either focus on physically meaningful latent variables (Solution 2) or symbolic expressions (Solution 3).
Another dimension along which the solutions can be compared is the degree to which their results can be used for different CPS use cases (see Table 2). For simplification, we generalize the multitude of CPS use cases into three areas: (i) system monitoring, which describes all tasks related to monitoring CPS e.g. anomaly detection, (ii) prognosis, which also includes simulation and predictive maintenance and (iii) diagnosis, including all applications that enable the root cause analysis related to system anomalies and failures.
All solutions have in common that they include a dimension reduction. Monitoring a few ideally meaningful variables to assess the overall system state is generally easier than visualizing a large number of sensor signals in the observation space. In addition, all solutions, with the exception of Solution 2, contain an autoencoder, which can be used for anomaly detection. With respect to prognosis-related use cases, the lower-dimensional representations of all solutions might be helpful when used as input features for downstream ML models. This is especially the case when the downstream tasks require labels and only a subset of the available data is labeled. Furthermore, the generative model of Solution 1 can be used for simulations by analyzing the effects of changes in the latent space on the overall system. Solution 3 stands out, however, because it can be used explicitly to predict the behavior of the system over time by means of the system dynamics equation. Finally, none of the solutions is readily suitable for diagnostic use cases. However, the information gained by applying the solutions might simplify the root cause analysis. For example, it may be relevant in which system state the anomaly occurred for the first time (Solution 4), or which of the meaningful latent variable shows an unexpected behavior (Solution 2).
In this paper we have investigated to what extent deep RepL methods can be used in the field of CPS. We identified four different solutions to learn concepts in the data. Using a simple three-tank system as an example, we tested a selection of algorithms and discussed their advantages and disadvantages. We showed that, for example, VAEs and communicating agents can be used to extract the most important physical quantities from multidimensional CPS sensor data. In addition, we demonstrated how to identify discrete system states with a SOM-VAE and showed that the Autoencoder-SINDy method can identify a mathematical expression describing the system dynamics. Thereafter, we discuss the significance of each method in terms of its utility and applicability in real CPS.
By applying recent algorithms from RepL on CPS, we have been able to show shortcomings of the solutions. An interesting direction for future research would be to combine the methods for a better fit to the characteristics of CPS data. For example, learning a symbolic expression could be greatly enhanced if the latent variables encode interpretable physical quantities, as demonstrated by the communicating agents. Additionally, by filtering for discrete state shifts, the complexity of the dynamic system can be greatly reduced. This paper has mainly focused on learning concepts with interpretable representations. In contrast, huge ML models that are trained on lots of data learn useful (but not interpretable) representations, which can be used to transfer knowledge to subsequent ML models. Likewise, transferring knowledge of physical concepts could improve ML models on CPS data. We believe this paper has shown the potential of concept learning and can motivate the development of algorithms that focus on the unique challenges CPS pose.
- Bengio et al.  Yoshua Bengio, Yann Lecun, and Geoffrey Hinton. Deep learning for AI. Communications of the ACM, 64(7):58–65, 2021.
- Devlin et al.  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Muhuri et al.  Pranab K Muhuri, Amit K Shukla, and Ajith Abraham. Industry 4.0: A bibliometric analysis and detailed overview. Eng. Appl. Artif. Intell., 78:218–235, February 2019.
- Nguyen and Medjaher  Khanh T P Nguyen and Kamal Medjaher. A new dynamic predictive maintenance framework using deep learning for failure prognostics. Reliab. Eng. Syst. Saf., 188:251–262, August 2019.
- Niggemann and Frey  Oliver Niggemann and Christian Frey. Data-driven anomaly detection in cyber-physical production systems. at - Automatisierungstechnik, 63(10):821–832, October 2015.
- Zhang et al.  Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V Chawla. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. AAAI, 33:1409–1416, July 2019.
- Balzereit et al.  Kaja Balzereit, Alexander Diedrich, Jonas Ginster, Stefan Windmann, and Oliver Niggemann. An ensemble of benchmarks for the evaluation of AI methods for fault handling in CPPS. In IEEE International Conference on Industrial Informatics (INDIN), pages 1–6, 2021.
- Bengio et al.  Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013.
- Hotelling  H Hotelling. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol., 24(6):417–441, September 1933.
- Kruskal  J B Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, March 1964.
- Hinton and Roweis  Geoffrey Hinton and Sam T Roweis. Stochastic neighbor embedding. In NIPS, volume 15, pages 833–840, 2002.
- van der Maaten and Hinton  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. J. Mach. Learn. Res., 9(86):2579–2605, 2008.
- McInnes et al.  Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction, February 2018.
- Kingma and Welling  Diederik P Kingma and Max Welling. Auto-Encoding variational bayes, December 2013.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z Ghahramani, M Welling, C Cortes, N Lawrence, and K Q Weinberger, editors, NIPS, volume 27. Curran Associates, Inc., 2014.
- Higgins et al.  Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
- Kim and Mnih  Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Jennifer Dy and Andreas Krause, editors, ICML, volume 80 of Proceedings of Machine Learning Research, pages 2649–2658. PMLR, 2018.
- Chen et al.  Ricky T Q Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, NIPS, volume 31. Curran Associates, Inc., 2018.
- Chen et al.  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In D Lee, M Sugiyama, U Luxburg, I Guyon, and R Garnett, editors, NIPS, volume 29. Curran Associates, Inc., 2016.
- Zhao et al.  Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: Balancing learning and inference in variational autoencoders. AAAI, 33(01):5885–5892, July 2019.
- Salakhutdinov and Hinton  Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. Artificial Intelligence and Statistics, pages 448–455, 2009. ISSN 1938-7228.
- van den Oord et al.  Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. NIPS, 30, 2017.
- Fortuin et al.  Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. Deep self-organization: Interpretable discrete representation learning on time series. In ICLR, 2019.
- von Birgelen et al.  Alexander von Birgelen, Davide Buratti, Jens Mager, and Oliver Niggemann. Self-organizing maps for anomaly localization and predictive maintenance in cyber-physical production systems. Procedia CIRP, 72:480–485, 2018. ISSN 2212-8271.
- Hranisavljevic et al.  Nemanja Hranisavljevic, Alexander Maier, and Oliver Niggemann. Discretization of hybrid CPPS data into timed automaton using restricted boltzmann machines. Eng. Appl. Artif. Intell., 95:103826, October 2020.
- Nautrup et al.  Hendrik Poulsen Nautrup, Tony Metger, Raban Iten, Sofiene Jerbi, Lea M Trenkwalder, Henrik Wilming, Hans J Briegel, and Renato Renner. Operationally meaningful representations of physical systems in neural networks, January 2020.
- Iten et al.  Raban Iten, Tony Metger, Henrik Wilming, Lídia Del Rio, and Renato Renner. Discovering physical concepts with neural networks. Phys. Rev. Lett., 124(1):010508, January 2020.
- Schmidt and Lipson  Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. Science, 324(5923):81–85, April 2009.
- Udrescu et al.  Silviu-Marian Udrescu, Andrew Tan, Jiahai Feng, Orisvaldo Neto, Tailin Wu, and Max Tegmark. AI feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity, June 2020.
- Udrescu and Tegmark  Silviu-Marian Udrescu and Max Tegmark. AI feynman: A physics-inspired method for symbolic regression. Sci Adv, 6(16):eaay2631, April 2020.
- Brunton et al.  Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. U. S. A., 113(15):3932–3937, 2016.
- Champion et al.  Kathleen Champion, Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Data-driven discovery of coordinates and governing equations. Proc. Natl. Acad. Sci. U. S. A., 116(45):22445–22451, November 2019.
- Lusch et al.  Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun., 9(1):4950, November 2018.
- Brunton et al.  Steven L Brunton, Marko Budišić, Eurika Kaiser, and J Nathan Kutz. Modern koopman theory for dynamical systems, February 2021.
- Kubalčík and Bobál  Marek Kubalčík and Vladimír Bobál. Predictive control of three-tank-system utilizing both state-space and input-output models. Sign, 2(1):1, 2016.
- Cho et al.  Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation, June 2014.
- de Silva et al.  Brian M de Silva, Kathleen Champion, Markus Quade, Jean-Christophe Loiseau, J Nathan Kutz, and Steven L Brunton. PySINDy: A python package for the sparse identification of nonlinear dynamics from data, April 2020.
- Kohonen  T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.