The historical development of machine learning algorithms on time series data has followed a clear trend from initial simplicity to state-driven complexity. For instance, limitations in the Hidden Markov Model (see6]
or Long Short-Term Memory model. In this parametric learning framework, success in modeling temporal data has been largely dictated by the model’s ability to store appropriate latent states in memory, and to correctly transition between states according to underlying dynamics of the data.
Among gradient based models , interpreting these high dimensional latent states, as well as relating them to observations, remains a difficult area of research. Moreover, modeling dependencies within complex temporal data, such as speech audio  and financial pricing movements, requires a large number of observations, a requirement which often limits their application in real-life scenarios.
1.1 The LESS Algorithm
In this paper, we introduce an interpretable, unsupervised, non-parametric approach to time series segmentation called LESS: Laplacian Events Signal Segmentation. LESS is motivated by multi-scale geometric ideas, and its core computations are simple linear algebraic operations and convolution. The algorithm featurizes temporal data in a template-matching procedure using wavelets. The resulting wavelet coefficient representation is a trajectory in state-space. We interpret time steps of this representation as nodes of an underlying graph, whose graph structure is informed by events implicit in the original signal. A partitioning of this graph via its Laplacian embedding results in event segmentation of the signal.
LESS takes inspiration from the intuitive insight that naturally occurring temporal data have few meaningful event types, or motifs, and thus converting signals to event sequences has the potential to drastically reduce transmission and storage requirements. We aim to derive the simplest unsupervised technique that mirrors existing spectral clustering applications in image and graph data domains.
After surveying related work in Section 2, we give a detailed description of our proposed LESS technique in Section 3. Then Section 4 gives an empirical analysis of the robustness-to-noise of LESS, as well as a careful analysis of its computational complexity.
A first application of LESS is shown in Section 5. Using the Free Spoken Digits Dataset, we visualize the temporal dynamics of spoken audio with its segmentation into meaningful events, such as strong vs weak enunciations of ‘rho’ in ‘zero’, and show a clear contrast between representation trajectories in wavelet coefficient space belonging to different spoken digits. Furthermore, we show LESS has superior performance to Dynamic Time Warping  and to SAX  despite summarizing time series observations in far more parsimonious fashion.
Finally, although we do not pursue it formally in this paper, Section 6 outlines ways that LESS has strong potential to fit within upstream fusion pipelines of heterogeneous-modality time series. Hence, we argue that LESS will make a key contribution to unsupervised classification, visualization, and fusion tasks, especially in scenarios where training data is limited and/or communication between computational nodes is constrained.
This paper has been cleared for public release, as Case Number 88ABW-2019-4842, 07 Oct 2019. Both authors were partially supported by the Air Force Research Laboratory under contract AFRL-RIKD FA8750-18-C-0009. We are grateful to Drs. Peter Zulch and Jeffrey Hudack (AFRL) for motivating discussions, to Christopher J. Tralie for discussions on the scattering transform, to Nathan Borggren and Kenneth Stewart for pre-processing and discussion of the dataset in Section 4, and to Tessa Johnson for assistance with a clean version of LESS implementation.
2 Related Work
We briefly survey related work, and situate LESS within the context they define.
Connections to wavelet theory
Matching pursuit  projects signal data into its sparse approximation via a dictionary of wavelets. By computing the approximation error using this dictionary, the algorithm greedily selects a new wavelet that maximally reduces this error and adds it to the dictionary. Matching pursuit can encode a signal as its sparse approximation using few wavelets. Our algorithm similarly encodes an input signal via a wavelet dictionary. However, we further reduce this wavelet representation by computing its implicit motifs, summarizing it as a sequence of categorical values.
Similarly, wavelet scattering   also utilizes a wavelet dictionary to derive sparse encodings of signals. In comparison to wavelet representations, scatter representations have the additional properties of being signal-translation invariant and capturing more complex frequency structure in the signal via its convolution network. We opt for wavelet scattering as the signal featurization procedure due to these additional properties. From the perspective of wavelet scattering, the LESS representation is in essence a temporal segmentation over the scatter coefficients.
Connections to graph theory
propose to monitor process drifts in multivariate time series data using spectral graph-based topological invariants. Like LESS, the proposed technique also considers multidimensional sensor signals and interprets an underlying state graph of the signal. By maintaining a window on an incoming stream of multidimensional data, vector time elements within the window form nodes of the graph. The authors observe that changes in the Fiedler number, a graph topological invariant computed from the graph’s Laplacian matrix, is informative towards process state. The empirical analysis demonstrates effective fault detection in process monitoring applications. LESS differs mainly by applying graph spectral techniques on wavelet-domain representations. In this way, complex frequency dynamics within the signal are accounted for in the state graph, and an event sequence representation of the signal in its entirety is generated by LESS, instead of a sequence of graph topological statistics emitted over the course of a sliding window.
Lacasa et al  propose the visibility algorithm, a process which converts time series into a visibility graph. Visibility graphs preserve the periodicity structure of the time series as graph regularity, while stochasticity in the time series are expressed as random graphs by the process. The visibility graph is invariant to translation and scaling of the time series, by which affine transformations of the time series lead to the same visibility graph. In comparison, our approach also generates translation invariant representations due to wavelet scattering. Periodicity within a time series is recognized as motifs by LESS, and as re-occuring tokens in the resulting event sequence representation.
Connections to time series modeling
Fused LASSO regression for multidimensional time series segmentation
combines breakpoint detection with computing breakpoint significance. After determining breakpoints implicit in the time series, their technique relies on clustering within each detected segment to estimate breakpoint significance. Similar to the proposed technique, LESS also seeks to address the multidimensional time series segmentation problem; although we segment according to a set of motifs implicitly determined by the time series, instead of identifying new segments for each local change in trend - this allows the desired number of motifs to be preset and easy identification of similar events, as they belong to the same motif.
is a time series compression technique leveraging a linear forecasting algorithm to be trained online for a data stream. By combining Delta Coding - a stateless, error-based forecast heuristic common in compression literature - with an autoregressive model of the form, the authors are able to predict an upcoming delta as a rescaled version of the most recent delta, leading to a more expressive version of Delta Coding.
This forecasting component of Sprintz named ‘Fast Integer REgression’ (FIRE) is further supported by bit packing and Huffman coding to efficiently handle blocks of error values. In comparison, the objective of our paper is to find structure within time series data by unsupervised learning, then aggressively decrease data representation into a sequence of categoric tokens, corresponding to events. While our run-time scales worse by viewing the time series in its entirety, LESS representations have the flexibility to be of predetermined length, depending on user preference in the granularity of signal segmentation. Our representations also convey frequency-temporal structures within the time series.
SAX  creates a symbolic representation of the time series via Piecewise Aggregate Approximation (PAA). By dividing the time series data into equal frames, the PAA representation models the data with a linear combination of box basis functions. By binning the PAA coefficients according to the coefficient histogram, and setting the global frame size, lengths of SAX symbolic sequences are easily controlled. LESS controls the length of its output representation by , the set of motifs hypothesized to be within the data. By decomposing the input signal via wavelet scattering, LESS featurization presents continuous signal structure in the frequency-temporal domain rather than discrete frames in the temporal domain. Moreover, the resulting wavelet coefficients are segmented by considering its relative location within the time series in wavelet coefficient space, rather than segmentation by binning according to a global coefficient histogram, as is the treatment of PAA coefficients by SAX.
3 The LESS Algorithm
The high-level steps (Figure 1 shows a flow diagram) in the LESS algorithm are as follows. First, wavelet scattering is applied to a raw time series , resulting in a wavelet representation . Then a weighted graph is computed from the wavelet representation, where the vertices of are . The weighted edges of are computed from the self-similarity matrix of followed by a varying bandwidth kernel. Finally, spectral clustering is applied to extract an event sequence , the final output of LESS. This section describes each of these technical steps in details. Further information on the parameters involved can be found in the Appendix.
3.1 Wavelet Representation
Wavelet scattering  creates a representation of a time series by composition of wavelet transforms. The main utility provided here is reducing uninformative variability in the signal - specifically translation in the temporal domain and noise components of the data. In the application of event segmentation, it is sufficient for wavelet scattering to capture the low frequency structure of . These properties are explained in detail in the rest of this section. Throughout this paper, we refer to scatter representation as ‘the wavelet representation’, even if the terminology describes a broader family of wavelet-derived objects.
[leftmargin=0pt, itemindent=20pt, labelwidth=15pt, labelsep=5pt, listparindent=0.7cm, align=left]
enables frequency selection
The wavelet transform of is the set of coefficients computed by convolving with a wavelet dictionary of wavelets
where each wavelet is indexed by frequency . is convolved with the signal to emit coefficients corresponding to frequency . Wavelet scattering re-applies wavelet transform on the coefficients of this procedure with the same wavelets, under the condition that an identical wavelet cannot be applied to the coefficients more than once. For applications of wavelet transform, a set of convolution coefficients are generated, with set size . We find that a small wavelet dictionary capturing low frequency components of is sufficient, leading to fast scatter computations and representations robust to instrumental noise (figure 7). In practice, is most effectively generated by 5 to 20 wavelets of lowest frequency, with .
is invariant to translations of .
That is, for a translation and translated signal , we have . As the majority of wavelet techniques are covariant with translation, shifting in time alters the wavelet representation. As we seek to construct an underlying graph capturing the dynamics of
, translation invariance is an important property for leveraging spectral graph theory techniques. It’s relevance emerges when considering the instability of Laplacian eigenfunctionsunder a changing graph . Despite being the same signal, shifts in the temporal domain of directly replaces vertices of , dependent on magnitude of the shift relative to length of the data. See below for a detailed discussion.
linearizes deformations in signal space.
For a displacement field (the deformation), is Lipschitz continuous to deformations if there exists such that for all and :
As an example, for a small deformation in the form of additive noise, is closely approximated by a bounded linear operator of . Like translation invariance, ’s linearization of deformations in signal space further reduces undesirable variability in the construction of .
Every signal is mapped by wavelet scattering to its wavelet representation . In scatter coefficient space , undesirable variability in the space of signals is removed while the frequency-temporal structure of signals is preserved.
3.2 Wavelet Trajectory Graph
Since may be iterated over by its temporal index , is also a trajectory in wavelet coefficient space. A change in frequency structure of the signal, from here on recognized as an ‘event’, is reflected in by movement within wavelet coefficient space. Moreover, re-occuring events return the trajectory to the same region due to their similar frequency characteristics.
By segmenting ’s traversal to prevalent stationary and transitory patterns throughout time, we automate the identification of event motifs. See figure 3 for an illustration, and figure 4 for trajectory plots belonging to real data. The role of spectral clustering is identifying prevalent time series motifs. By segmentation of ’s dynamics according to motif durations, a sequence of events in is revealed.
The algorithm constructs an underlying graph of the wavelet representation ; the time elements of form the vertex set of
The weighted edges of
are determined by the affinity matrixbetween time elements. The pairwise affinity is
where , the wavelet dictionary size. is the euclidean distance, a global scaling parameter. is the adaptive kernel size between time elements indexed at and .
An adaptive kernel size, in contrast to a global kernel size, mitigates overly or sparsely sampled local neighborhoods. Following standard practice , we choose as adaptive to local neighborhood distances. We compute the average distance between and its nearest neighbors to approximate the local sampling density, and denote this average neighborhood distance by . The adaptive kernel is
The interpretation of as a graph is quite literal. Observe that signal events inform the graph structure of - stationary trends in are easily captured as clique-like substructures in , since stationary samples of the signal are all proximal in wavelet coefficient space. 2-state periodic behavior in may be reflected as a cycle connecting two cliques.
3.3 Event Identification
To exploit this insight on the connectivity and communal graph information, we leverage spectral clustering to automate the partitioning of , where a subset of time elements corresponds to an event type in the data. At a high level, this is accomplished by embedding graph vertices by the eigenfunctions of the graph Laplacian . Specifically, with adjacency matrix and degree matrix of , we compute the normalized Laplacian
By applying eigendecomposition to , we extract the eigenfunctions of the Laplacian.
, the Laplacian embedding of eigenfunctions corresponding to the smallest Laplacian eigenvalues, contains information regarding the stable cuts of. See von Luxburg’s excellent survey  for a detailed treatment.
is a ‘tall’ matrix whose rows denote embedding coordinates for individual vertices. The eigenfunctions corresponding to larger eigenvalues yield more instable cuts, and therefore more noisy, vertex partitions. For most naturally occuring times series, the first three eigenfunctions of the graph Laplacian are sufficient. For the sake of clear notation, by we refer to the low rank embedding of .
, each connected component is encoded by a Laplacian eigenvector within eigenspace 0 (in other words the geometric multiplicity of eigenvalue 0 matches number of connected components), where an eigenvector is an indicator vector withif the vertex belongs to that component. For Laplacian eigenvectors corresponding to non-zero eigenvalues, they partition connected components within . The eigenvector of smallest non-zero eigenvalue, the Fiedler vector, lists a partition from the most stable cut of – a cut partitioning a connected component that acts most like a bottleneck of the component, i.e. in the context of the normalized Laplacian, a cut of maximal flow whose removal leads to two components of similar volume.
Finally, spectral clustering applies -means clustering to , assigning cluster memberships that partition . . The number of cluster centroids, is a LESS parameter dictating the number of motifs to be considered while segmenting . For larger , higher time-resolution events may be observed, though overfitting in the form of short, spurious events is likely to occur.
By applying spectral clustering to the wavelet trajectory graph , we partition the wavelet representation , and by inversely mapping cluster assignments to the original temporal domain of the signal, LESS segments into a sequence of categoric tokens , or event sequence of length :
4 Properties and Computational Characteristics
4.1 Moving Average and Noise
This subsection describes two LESS properties: applications to signals with constant moving average and to noisy signals.
For signals with constant moving average through time, i.e. without trends, LESS event sequences tend to reflect local changes in Root Mean Square envelope and energy of the signal. To visualize this, LESS is applied to the recently-released ESCAPE dataset , a collection of live scenes where various vehicles are observed by sensors of different modalities. For illustration purposes here, we focus only on the audio modality and on four particular single-vehicle runs with vehicles of distinct types. In figure 6, four 1-minute audio recordings, belonging to single vehicles of different types, are concatenated then annotated by LESS. In the event sequences below each recording, common colors across observations denote events belonging to the same motif.
LESS representations remain robust in the presence of significant noise. By omitting wavelets corresponding to high frequency indices in wavelet scattering, the resulting wavelet representation does not account for high frequency structure in the original signal. Thus, while decreasing the number of necessary convolution operations, the algorithm regularizes the representation.
We illustrate this desirable property via an observation of the MIT BIH electrocardiogram data 
. As heart muscles contract in a cyclical pattern, the cardiac cycle of P wave, QRS complex, ST segment and T wave is consistently identified by the event sequence of events light blue, green, orange, and brown respectively. In addition to the ECG signal, Gaussian noise of varianceand was added. As seen in Figure 7, LESS has identified all noticeable patterns within the original signal accurately; moreover, it continues to output similar annotations under increasing noise.
4.2 Computational Complexity and Scaling with Ambient Dimension
We analyze the computational load of LESS in this subsection. Note that LESS first applies wavelet scattering and then spectral clustering. In case of a multivariate signal, LESS computations scale linearly with size of the ambient dimension. Altogether, we claim that LESS has the following computational complexity
where is the number of ambient dimensions, the length of the time series, and the number of time steps in wavelet representation. We argue below that in general practice, and thus the cubic term in the complexity analysis should not be that daunting.
To see this claim, we note (see ) that scattering computation on -dimensional signals can be done in with a signal of length and a scatter network of depth . In practice, scatter networks of depth are optimal. Given a multivariate signal in , we may naively apply the scatter computation times to generate coefficients for each dimension of the input, showing that the wavelet scattering of multivariate time series has complexity In practice, we note that modern numeric packages (e.g, kymatio ) for wavelet techniques leverage Graphic Processing Units (GPUs) to accelerate convolution operations.
The remainder of the computation concerns neither the number of ambient dimensions nor the length of input data . During wavelet scattering, the extraction of wavelet coefficients is preceded by convolution with a low pass filter , where the original length is sampled at intervals , a scaling parameter set prior. Given a length signal, its wavelet representation has time steps: with increasing , decreases exponentially relative to . For most reasonable values of , we have .
Given a dimensional time series, there are applications of wavelet scattering, each application with the same wavelet dictionary of size . What results is a dimension-wise concatenated wavelet representation of coefficients. To compute the affinity matrix for spectral graph analysis, the vectors in are subtracted pairwise. Using vectorized subtraction, the distance matrix is computed in . After transferring high dimensional coefficient information into connectivity relationships in the underlying graph , the spectral clustering portion of the algorithm becomes agnostic to the ambient dimension of the input time series.
Lastly, spectral clustering’s computational bottleneck arises in acquiring a low rank approximation to the affinity matrix. Using the Nystrom method , there is an inherent two step orthogonalization procedure, where operations is incurred. In total, LESS operates with computational complexity as claimed.
This section outlines a first proof-of-principle application for LESS. We focus on distinguishing spoken digits from one another, using audio recordings, in an entirely unsupervised manner. That is, we use LESS to derive a distance between spoken audio observations, and then observe that clusters formed by this distance tend to correspond to digit type. The performance is especially encouraging when compared with two other common methods for deriving distances between time series data, Dynamic Time Warping and SAX.
The Free Spoken Digits Dataset(FSDD) consists of spoken digit recordings in wav files at 8kHz. FSDD contains 2,000 recordings from 4 speakers, each speaker saying digits 0 to 9 ten times per digit.
Representations derived from the proposed technique remain informative towards classification tasks. To illustrate this, we compute within-class and between-class distances between event sequences. As example, given parameters and possible event types, , LESS maps signal to an event sequence
Such an event sequence contains categoric tokens corresponding to changes in the frequency structure of , and consecutive tokens of the same motif are replaced with one in their place. The Levenshtein distance  is a standard method to compare any two token strings. Figure 8
shows the Levenshtein distance matrix computed between LESS representations of 30 short audio strings, 10 each of digit types ’0’, ’1’, and ’3.’ The rows of the matrix are ordered by digit type. For our choice of hyperparameters, the event sequence lengths range from four events (or tokens) to eight, ensuring transmission parsimony. Within each class, our technique generated event sequences that are, in expectation, 2 edits away. As seen in the block diagonal structure of the matrix, between class distances are higher.
We compare to two other methods of computing distances between signal snippets, Dynamic Time Warping (DTW, ) and SAX . In brief, DTW produces an optimal match between the time indices of two time series observations that is robust to non-linear temporal distortions, in addition to temporal translation. Given a pair of signals , the DTW distance refers to the minimal cost required to align time indices between and , or equivalently the matching cost of their optimal match. Note that DTW can be computed on pairs of time series in any metric space; here we use it both in signal space and in wavelet coefficient space . SAX, on the other hand, creates a symbolic representation of the time series via Piecewise Aggregate Approximation (PAA). The output of SAX is very similar to that of LESS in form, namely a sequence of categorical tokens. Two such token sequences can then be compared using Levenshtein distance as above.
Figures 9 and 10 shows the results of the comparison experiments. Each column of each figure corresponds to a single experiment, where the task is to distinguish between spoken instances of two digit types. The top row in both figures uses the LESS-Levenshtein distance. The middle and bottom rows of Figure 9 use DTW computed on the scattering space time series and the raw time series, respectively. Qualitatively, the advantage of the LESS technique over the others at this task is clear. Furthermore, there are two other practical advantages: 1) in wallclock time, the approximal DTW algorithm, Fast DTW , took an order of magnitude of time longer than LESS while making comparisons; 2) the advantage of LESS over DTW in a communications-constrained environment should be clear: before DTW is computed on two signals arising from different sources, the entire signal must be transmitted, while LESS requires transmission of only the much shorter event sequence.
Figure 10 shows the same experiments in a comparison with SAX. A SAX parameter alphabet size is the number of token types available for labeling fixed width windows. We set alphabet size to 10 following the recommendation of SAX authors. We set representation lengths to be 100 (middle row) and 1000 (bottom row). For representation lengths of 1000, within-class structure begins to emerge in distance matrices. Again, the practical advantages of LESS in terms of computational speed and transmission in communications-constrained environments hold.
This paper presented LESS: Laplacian Events Signal Segmentation, a graph spectral representation of arbitrary-dimensional time series data. Despite its unsupervised nature, LESS is shown to be highly performant at a digit classification task, and especially so when judged in data and/or communication constrained environments. Further work will proceed along several fronts:
6.1 Memory Issues
Despite the nice computational complexity analysis above, the current implementation of LESS presents serious memory management issues on conventional hardware. An application of LESS requires the concatenation of multiple signals under consideration into one array, due to spectral clustering’s memory-less nature. This is impractical for large batches: 1) the eigendecomposition term in the complexity analysis dominates the computation and 2) may not be stored for the entire dataset . Hence, on a desktop computer with conventional hardware, LESS may process 5-10 minutes of audio sampled at 48,000 Hz within 30 minutes.
6.2 Towards LESS as a Fusion Technique
The experiments above only display the benefits of LESS as a compression technique for tokenizing high-dimensional time series before transmission down a stingy channel. Beyond this paper, we are developing methods to use LESS within upstream fusion pipelines. The most immediate approach is to note that LESS involves the computation of a weighted graph based of a distance matrix (see the block in Figure 1). When faced with distinct time series of arbitrary dimensionality, one could simply run LESS up to this block, producing weighted graphs . Any number of distance-graph-based upstream fusion techniques (e.g, similarity network fusion ,  or joint manifold learning , ) could then be used to produce a fused weighed graph . The final step of LESS could then be applied to produce the fused event sequence. Experiments must be done to show that the resulting event sequence is indeed more informative, at tasks such as the ones outlined above, than stovepiped event sequences.
In a communication-constrained environment, the transmission of entire weighted graphs might be too expensive. Hence further work needs to be done on this front, either by: 1) pursuing sparse representations of the weighted graph, or; 2) creating an event-sequence-level fusion algorithm.
-  Mathieu Andreux, Tomás Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, Stéphane Mallat, Joakim Andén, Eugene Belilovsky, Joan Bruna, Vincent Lostanlen, Matthew J. Hirn, Edouard Oyallon, Sixhin Zhang, Carmine-Emanuele Cella, and Michael Eickenberg. Kymatio: Scattering transforms in python. CoRR, abs/1812.11214, 2018.
-  Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994.
-  Davis Blalock, Samuel Madden, and John Guttag. Sprintz: Time series compression for the internet of things. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 2(3):93:1–93:23, September 2018.
-  J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, Aug 2013.
-  T Tony Cai and Lie Wang. Orthogonal matching pursuit for sparse signal recovery with noise. Institute of Electrical and Electronics Engineers, 2011.
-  Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018.
-  Mark A. Davenport, Chinmay Hegde, Marco F. Duarte, and Richard G. Baraniuk. High dimensional data fusion via joint manifold learning. In AAAI Fall Symposium: Manifold Learning and Its Applications, 2010.
-  Yufei Han and Maurizio Filippone. Mini-batch spectral clustering. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 3888–3895. IEEE, 2017.
-  Weiyu Huang, Thomas A. W. Bolton, John D. Medaglia, Danielle S. Bassett, Alejandro Ribeiro, and Dimitri Van De Ville. A graph signal processing view on functional brain imaging, 2017.
-  Brijnesh J. Jain. Generalized gradient learning on time series. Machine Learning, 100(2):587–608, Sep 2015.
-  Lucas Lacasa, Bartolo Luque, Fernando Ballesteros, Jordi Luque, and Juan Carlos Nuño. From time series to complex networks: The visibility graph. Proceedings of the National Academy of Sciences, 105(13):4972–4975, 2008.
-  Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966.
-  Mu Li, Xiao-Chen Lian, James T Kwok, and Bao-Liang Lu. Time and space efficient spectral clustering via column sampling. In CVPR 2011, pages 2297–2304. IEEE, 2011.
-  Jessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. Experiencing sax: a novel symbolic representation of time series. Data Mining and Knowledge Discovery, 15(2):107–144, Oct 2007.
-  Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
-  Stephane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398.
-  R.G. Mark, P.S. Schluter, G.B. Moody, P.H. Devlin, and D. Chernoff. An annotated ecg database for evaluating arrythmia detectors. IEEE Transactions on Biomedical Engineering, 29, 1982.
-  Nooshin Omranian, Bernd Mueller-Roeber, and Zoran Nikoloski. Segmention of biological multivariate time-series data. Scientific Reports, 2015.
-  Stan Salvador and Philip Chan. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5):561–580, 2007.
-  M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7):1396–1403, Nov 2007.
-  Dan Shen, Erik Blasch, Peter Zulch, Marcello Distasio, Ruixin Niu, Jingyang Lu, Zhonghai Wang, and Genshe Chen. A joint manifold leaning-based framework for heterogeneous upstream data fusion. Journal of Algorithms & Computational Technology, 12(4):311–332, 2018.
-  M. S. Tootooni, P. K. Rao, C. Chou, and Z. J. Kong. A spectral graph theoretic approach for monitoring multivariate time series data from complex dynamical processes. IEEE Transactions on Automation Science and Engineering, 15(1):127–144, Jan 2018.
-  Christopher J. Tralie, Paul Bendich, and John Harer. Multi-scale geometric summaries for heterogeneous sensor fusion. In Proc. 2019 IEEE Aerospace Conference.
-  Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
-  Bo Wang, Aziz M Mezlini, Feyyaz Demir, Marc Fiume, Zhuowen Tu, Michael Brudno, Benjamin Haibe-Kains, and Anna Goldenberg. Similarity network fusion for aggregating data types on a genomic scale. Nature methods, 11(3):333, 2014.
-  Donghui Yan, Ling Huang, and Michael I Jordan. Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 907–916. ACM, 2009.
-  Walter Zucchini, Iain L MacDonald, and Roland Langrock. Hidden Markov models for time series: an introduction using R. Chapman and Hall/CRC, 2017.
-  Peter Zulch, Marcello DiStasio, Todd Cushman, Brian Wilson, Ben Hart, and Erik Blasch. Escape data collection for multi-modal data fusion research. In Proceedings of the 2019 IEEE Aerospace Conference.
This appendix gives more details on the parameters involved in LESS. All parameters are inherited from wavelet scattering and spectral clustering.
This paper implements wavelet scattering from the Kymatio package , and the parameter notations are adopted from Kymatio documentation. Scaling parameter dilates morlet wavelets by a factor of . As a wavelet’s sinusoid component dilates, convolution with the signal leads to decorrelated scales. This reveals the frequency structure of the signal at a higher resolution, while decreasing temporal resolution. In practice in the range is suitable for most signal data, while short signals require due to the lack of adequate temporal support for wavelets. is the number of first-order wavelets per octave. For most applications, incrementally exploring by multiples of seems efficient. In practice, selecting the wavelets indexed by lowest frequencies are sufficient.
, the spectral clustering parameter found in affinity matrix computations, controls the notion of global proximity. For increasing , kernel radii surrounding points expand, resulting in larger edges in . For computing kernel sizes of a normalized pairwise-distance matrix, is optimal.
By examining the Laplacian eigenvectors , vertices are encoded into the Laplacian embedding . For all LESS experiments shown, only has been used; but membership information contained in eigenvectors of larger eigenvalues may prove beneficial.
Lastly , the number of motifs, is the number of clusters in -means clustering applied onto the embedding . In simple signal data that rarely exhibit novel events, such as FSDD, is sufficient. On the other hand, to transform entire audio scenes, larger () is required.
The event sequence may be interpreted as annotation of the length wavelet representation . While running various classification tasks, we find remains performant after discarding consecutive tokens of the same motif, and only note when the event type has changed.