Attention-based Neural Bag-of-Features Learning for Sequence Data

In this paper, we propose 2D-Attention (2DA), a generic attention formulation for sequence data, which acts as a complementary computation block that can detect and focus on relevant sources of information for the given learning objective. The proposed attention module is incorporated into the recently proposed Neural Bag of Feature (NBoF) model to enhance its learning capacity. Since 2DA acts as a plug-in layer, injecting it into different computation stages of the NBoF model results in different 2DA-NBoF architectures, each of which possesses a unique interpretation. We conducted extensive experiments in financial forecasting, audio analysis as well as medical diagnosis problems to benchmark the proposed formulations in comparison with existing methods, including the widely used Gated Recurrent Units. Our empirical analysis shows that the proposed attention formulations can not only improve performances of NBoF models but also make them resilient to noisy data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/13/2018

Attention-based Deep Multiple Instance Learning

Multiple instance learning (MIL) is a variation of supervised learning w...
01/24/2019

Temporal Logistic Neural Bag-of-Features for Financial Time series Forecasting leveraging Limit Order Book Data

Time series forecasting is a crucial component of many important applica...
11/02/2017

Audio Set classification with attention model: A probabilistic perspective

This paper investigate the classification of the Audio Set dataset. Audi...
10/30/2018

Recurrent Attention Unit

Recurrent Neural Network (RNN) has been successfully applied in many seq...
02/08/2021

Towards Accurate RGB-D Saliency Detection with Complementary Attention and Adaptive Integration

Saliency detection based on the complementary information from RGB image...
02/11/2020

Feature Importance Estimation with Self-Attention Networks

Black-box neural network models are widely used in industry and science,...
05/05/2021

Learning Feature Aggregation for Deep 3D Morphable Models

3D morphable models are widely used for the shape representation of an o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learning problems in many fields involve sequence data such as time-series forecasting [32, 18], audio analysis [8, 28]

or natural language processing

[3, 2], all of which have been extensively studied. In many application scenarios, the observed sequence is highly non-stationary and noisy, which makes the task of modeling the underlying generating process more difficult. For example, in sound source separation in which the objective is to recover different unknown sources by filtering the observed mixtures, the existence of environmental noise is inherent and often complicates the separation process. Several mathematical techniques have been proposed to model the underlying data and noise distributions or to extract hand-crafted features, capturing certain desirable properties. In financial time-series analysis, representative examples include autoregressive (AR) and moving average (MA) [27] features, which were later extended with a differencing step to eliminate nonstationarity, known as autoregressive integrated moving average (ARIMA) [29]

. Gaussian processes and Hidden Markov Model were popular mathematical frameworks in audio analysis. To ensure mathematical and computational tractability, these classical models are often formulated under many assumptions, which are sensitive to initialization and misaligned with real-world conditions, thus limiting their professional usage in practice.

During the last decade, thanks to the development in stochastic optimization techniques and computing hardware, as well as the declining costs of data acquisition and storage, a data-driven approach based on deep neural networks and stochastic optimization has replaced the classical model-based approach and convex optimization. Nowadays, many of the state-of-the-art solutions for learning with sequence data are developed on the basis of neural networks. Notably, a class of neural network architecture called Recurrent Neural Networks (RNN), which is specifically designed to process variable-length sequences and to capture sequential patterns, has become the main workforce in different application domains. Another dedicated neural formulation for sequence data is the bilinear structures

[32, 30, 31]

, which were proposed to separately capture the dependencies along the temporal and spatial dimension in financial time-series. Even existing neural architectures, which were originally proposed for visual inputs such as Convolutional Neural Network (CNN)

[14] and Neural Bag-of-Features (NBoF) [22], have shown competitive performances in tackling sequence data compared to dedicated statistical models [34, 23]. The advantage of neural formulations over statistical learning and traditional hand-crafted features lies in the fact that fewer assumptions are made, and data is leveraged to automatically identify and extract task-relevant features in an end-to-end fashion.

Bag-of-Features (BoF) model [13] was originally proposed to build histogram representations from images. Later, it was shown that BoF could be successfully applied to extract high-level representations for other data modalities such as video and audio [11, 25, 9, 10]. Learning BoF representations consists of two steps: dictionary learning and feature quantization and encoding. In the dictionary learning step, each object is first represented by a set of low-level features, which could be, for example, a collection of local descriptors like SIFT [15]

for image object or word-level vector-encodings for a sentence object. These features are then used to generate a compact dictionary (codebook) comprising of the most representative features, also known as

codewords. In the second step, the histogram representation of each object is extracted by quantizing its low-level features using the codebook.

Recently, Neural Bag-of-Features (NBoF) [22]

, a neural network generalization of the BoF model, has been proposed. Similar to its predecessor, NBoF can generate a fixed-size histogram vector from variable-size inputs. This neural network generalization works as a feature extraction layer, which can be combined and optimized jointly with other neural network layers to tackle both unsupervised and supervised objectives via stochastic optimization. Since the dictionary learning step in NBoF is updated in conjunction with other layers towards the end goal of optimizing an objective function, histogram vectors synthesized by NBoF are more representative than those produced by BoF in different learning scenarios such as visual recognition, information retrieval, and financial forecasting

[21, 22, 23].

While the NBoF model works well in different learning problems, the current formulations still possess some limitations. In the aggregation step, all of the quantized features are simply averaged to form the histogram vector. For sequence data, this implies that the model only allows equal contributions of the quantized features coming from different time steps to form the output representation. Similarly, the quantization results produced by each codeword are considered equally important for every sequence in the training set. These properties limit the dictionary learning, quantization, and encoding process to fully take advantage of the data-driven approach.

To incorporate a higher degree of flexibility into the NBoF model, a weighing mechanism on a sequence level is desirable. That is, for each individual sequence, the model has the flexibility to perform a weighted sum of quantized outputs in the aggregation step, with the coefficients being adaptively changed with respect to the input sequence, or to select/discard irrelevant codewords, given the input sequence. In neural network literature, this is often achieved by having some attention mechanisms [33, 17, 31]. The idea of attention is inspired by the phenomenon observed in the human visual cortex that visual stimuli from multiple objects actively compete for neural encoding.

Although various attention mechanisms have been proposed for existing neural network architectures such as CNN [33, 12], LSTM [17, 24] or Bilinear structure [31], there is yet any formulation for the NBoF model when learning with sequence data. To have a generic attention mechanism that can be applied in a plug-and-play manner, in this work, we propose 2D-Attention (2DA), a neural network module that promotes competitions among different rows or columns in the input matrix and only (soft) selects those which win for attention. We will then demonstrate that by injecting 2DA into NBoF, we can overcome those limitations mentioned previously. The contributions of our work can be summarized as follows:

  • We propose a new type of attention formulation for matrix data, which is dubbed as 2DA. The proposed layer acts as a complementary computation block, which is capable of identifying relevant sources of information to perform selective masking on the given input matrix.

  • We incorporate 2DA into different stages of the Neural Bag-of-Features (NBoF) model, creating various 2DA-NBoF extensions that can enhance the feature quantization or histogram accumulation step in the NBoF model. Extensive experiments were conducted in three different application domains: financial forecasting, audio analysis, and medical diagnosis, which demonstrate the effectiveness of our attention module in improving the NBoF model. In cases of noisy input, a variant of 2DA-NBoF shows resilience to noises by filtering out the noisy source of information before the feature quantization step.

The remainder of the paper is organized as follows: in Section II, we review the NBoF model and its extensions for time-series data, as well as previously proposed attention mechanisms in the neural network literature. In Section III, we first present the proposed attention module 2DA and its interpretation. Several extensions of the NBoF model that incorporates 2DA are then presented. In Section IV, we provide details of our experiment protocols and quantitative analysis. Section V concludes our work with possible future research directions.

Ii Related Work

The NBoF model [22]

consists of two components: a quantization layer and an accumulation layer. Each quantization neuron in the quantization layer performs like a codeword, which can be updated via BackPropagation algorithm. In the original formulation

[22]

, the Radial Basis Function (RBF) layer was used for feature quantization. Recently, it has been shown that the hyperbolic kernel is also effective for the feature quantization step

[20]. Here we describe the original formulation with RBF layer.

Let be the number of neurons (codewords) in the RBF layer and be the -th codeword. In addition, the shape of the Gaussian function modeled by each neuron can be adjusted via parameter . Let us denote the sequence of features as with . The output of the -th RBF neuron given the input feature is the following:

(1)

where denotes element-wise product and is the learnable weight vector that enables the shape of Gaussian kernel associated with the -th RBF neuron to change.

As the sequence goes through the quantization layer, each feature is quantized as , producing a sequence of quantized features . The accumulation layer aggregates the information in by calculating the averaged quantized feature:

(2)

There have been few extensions of NBoF model for sequence data. For example, Temporal Neural Bag-of-Features (TNBoF) model with different specialized codebooks has been proposed in [19] to capture both short-term and long-term temporal information in financial time-series. In [20], the authors derived the logistic formulation of the NBoF model using the hyperbolic kernel instead of the RBF kernel for the quantization step and proposed an adaptive scaling mechanism which showed significant improvements in training stability and performance of the NBoF networks.

While the attention mechanism was biologically inspired from the perspective of visual processing, this technique has also inspired and advanced several works in sequence data analysis, notably in sequence-to-sequence learning tasks. The first attention formulation applied to sequence data was proposed in [2] for tackling machine translation tasks. In this formulation, the authors proposed to construct the context vectors in Sequence-to-sequence Recurrent Neural Network model by selectively combining some hidden states, rather than using the last hidden state as the context vector. The selection coefficients, also known as attention weights, are computed adaptively based on the given input sequence, and updated jointly with other parameters during stochastic optimization.

The successful application of attention mechanism in machine translation tasks has led to the emergence of other attention formulations, which are designed to capture different types of salient information in sequence data. For example, in [4], the authors proposed a formulation that can detect pseudo-periods in certain types of time-series, such as energy consumption or meteorology data. To predict the future stock index, a dual-stage attention mechanism was proposed in [24] for RNN to actively select relevant exogenous series and temporal instances. Similarly, to highlight and focus on important temporal events in Limit Order Book, the authors in [31] proposed a method to calculate attention masks for bilinear networks. Although an attention formulation has been proposed for the convolutional NBoF model in [12]

to estimate the true color of images capturing by different devices, this formulation only works with image data. To the best of our knowledge, there has been no attention formulation for the NBoF model to tackle sequence data.

Iii Proposed Methods

In this Section, we will first present 2D-Attention (2DA), our proposed attention calculation for matrix data. Then, we will show how 2DA can be used to address different limitations of the NBoF model as described in Section I. Throughout the paper, we denote scalar values by either lower-case or upper-case characters , vectors by lower-case bold-face characters , matrices by upper-case bold-face characters , and mathematical functions by calligraphy characters . In addition, we use to denote the element at position in a matrix .

Fig. 1: Illustration of the proposed attention formulation (2DA) and different attention-based NBoF models

Iii-a 2D-Attention

A matrix

is a second-order tensor which has two modes, with

and are the dimensions of the first and second mode, respectively. The matrix representation provides a natural way to represent a signal with two different sources of information. For example, a multivariate time-series is represented by a matrix with one mode representing the temporal dimension, and the other mode represents different sources that generate individual series.

The general idea of attention mechanism is to highlight important elements in the data while discarding irrelevant ones. For data represented as a matrix , rather than considering each element in individually, we would like to actively select certain columns or rows of while discarding the others. This is because columns or rows of usually form coherent sub-groups of the data. For example, discarding some temporal events or some individual series in a multivariate series corresponds to removing some rows or columns, depending on the orientation of the matrix.

To adaptively determine and focus on different columns or rows of a matrix, we propose 2D-Attention (2DA), with the functional form denoted by . This function takes a matrix as the input, and returns as the output. That is:

(3)

can be considered as a filtered version of , where irrelevant columns of with respect to the learning problem are zeroed out. Here we should note that performs adaptive attention with respect to columns of . To focus on different rows of , we can simply apply to the transpose of .

The selection or rejection of the columns of is conducted via element-wise matrix multiplications as follows:

(4)

where denotes the attention mask with values in the range . Each column in encodes the importance of the corresponding column in . That is, the attention mask contains values that are close to corresponding to those columns in that contain important information for the downstream learning task and vice versa. In Eq.(4), parameter , which is jointly optimized with other parameters, is used to allow flexible control of the attention mechanism: when contains redundant or noisy information in its columns, the effect of attention mask is enabled by pushing close to ; on the other hand, when every column of is necessary, i.e., there is no need for attention, pushing close to will disable the effect of . The necessity of attention is thus automatically determined by optimizing with respect to a given problem.

To calculate the attention mask , the proposed 2DA method learns to measure the relative importance between columns of via a specially designed weight matrix : all elements of are learnable, i.e., they are updated during stochastic optimization, except the diagonal elements, which are fixed to . The attention mask is calculated as follows:

(5)

where denotes the soft-max function that is applied to every row of . That is, every element of is non-negative, and each row of sums up to . Similar to other attention formulations [24, 2, 33], we use soft-max normalization to promote competitions between different columns of .

As mentioned previously, the weight matrix is used to measure the relative importance between columns of , which is encoded in , and thus . In order to see this, let us denote by and the -th column of and , respectively. Since , the -th column of , i.e., , is calculated as the weighted combination of columns of , with the weight of the -th column always equal to since the diagonal elements of are fixed to . In this way, element (in ) encodes the relative importance of (in ) with respect to other , for .

Iii-B Attention-based Neural Bag-of-Features

In this subsection, we will show how the proposed attention module 2DA can be used to address different limitations of the NBoF model described in Section I.

Codeword Attention: in the NBoF model, quantization results produced by each quantization neuron (codeword) are considered equally important for every input sequence. This property limits the feature quantization step to fully take advantage of the data-driven approach. In order to overcome this limitation, the proposed 2DA block can be applied to the quantized features to highlight or discard the outputs of certain quantization neurons. By doing so, the NBoF model is explicitly encouraged to learn a subset of specialized codewords for a given input pattern.

Particularly, given the quantized features denoted by as described in Section II, we propose to apply attention to the rows of because the first mode of with dimension denotes the number of quantization neurons or codewords. Since 2DA operates on the columns of the input matrix, the attention-based quantized features is calculated as follows:

(6)

where denotes the transpose of .

Temporal Attention: another limitation of the NBoF model lies in the aggregation step. In order to produce a fixed-length representation of the input sequence, the aggregation step in the NBoF model simply computes the mean of quantized features along the temporal mode. In this way, the NBoF model only allows equal contributions of all quantized features, disregarding the temporal information. In fact, the idea of giving different weights to different time instances has been adopted in previous works under different formulations [31, 24]. Using our proposed 2DA formulation, it is straightforward to enable the NBoF model to attend to salient temporal information as follows:

(7)

Since each column of contains quantized features of each time step, to obtain temporal attention-based features we simply apply to as in Eq. (7). is then averaged along the second dimension to produce the fixed-length representation of the input sequence. Although we still perform averaging in the aggregation step, the fixed-length representation is no longer the average of the quantized features, but a weighted average. This is because each time instance (column) in has been scaled by different factors via the attention mechanism.

Input Attention: noisy data is an inherent problem in many real-world applications. Noises might surface during the data acquisition process, such as ambient noise in audio signals or power line interference and motion artifacts in Electrocardiogram signals. In other scenarios, noises are inherent in the problem formulation since the relevance between the input sources and the targets might be unclear. For example, in stock prediction, it is intuitive to use related stocks’ data, e.g., those coming from the same market sector, as the input to construct forecasting models, although some of them might be irrelevant to the movement of the target stock.

The proposed attention mechanism can also be used to filter out potential noisy series in a multivariate series as follows:

(8)

where denotes an input sequence of steps of the NBoF model as specified in Section II. Since we would like to apply attention over the individual series (rows of ), is applied to the transpose of .

The proposed attention variants of the NBoF model are illustrated in Figure 1.

Iv Experiments

In this section, we provide detailed descriptions and results of our empirical analysis, which demonstrate the advantages of attention-based NBoF models proposed in Section III

. Experiments were conducted in different types of sequence data, namely financial time-series in stock movement prediction problem, Electrocardiogram (ECG) and Phonocardiogram (PCG) in heart anomaly detection problems, and audio recording in music genre recognition and acoustic scene classification problems.

The experiments were conducted with the recently proposed logistic formulation of the NBoF model [20], i.e., the hyperbolic kernel was used in the quantization layer. In addition, we also experimented with the temporal variant of the NBoF model as proposed in [19] with a long-term and a short-term codebook. This variant is denoted as TNBoF. The codebook attention, temporal attention, and input attention when applied to the NBoF model are denoted as NBoF-CA, NBoF-TA, and NBoF-IA, respectively. The corresponding attention variants for the TNBoF model are denoted as TNBoF-CA, TNBoF-TA, and TNBoF-IA. In addition to the NBoF and TNBoF models serving as the baseline models, we also evaluated RNN models using Gated Recurrent Units (GRU) [3].

Iv-a Financial Forecasting Experiments

Although extensively studied over the last decades, financial forecasting still remains as the most challenging tasks among time-series predictions [26]. This is due to the complex dynamics of the financial markets, which make the observed data highly stationary and noisy. For this reason, we selected the stock movement prediction problem in FI2010 dataset [18] as a representative problem in time-series forecasting. FI2010 is the largest publicly available Limit Order Book (LOB) dataset, which contains approximately million order events. The limit orders came from Finnish stocks traded in Helsinki Exchange (operated by NASDAQ Nordic) over business days. At each time instance, the dataset provides information (the prices and volumes) of the top levels, leading to a -dimensional vector representation.

The FI2010 dataset is used to investigate the problem of mid-price movement prediction in the next order events. The mid-price at a given time instance is the average between the best buy and best sell prices. This quantity is a virtual price since no trade can happen at this particular price at the given time instance. The movement of mid-price (stationary, increasing, decreasing) reflects the dynamic of the LOB and the market, thus plays an important role in financial analysis. The dataset provides the movement labels, given the future horizon . Details regarding the FI2010 dataset and LOB can be found in [18].

We followed the same experimental setup proposed in [31], which used the first days for training the models and the last days to test the performances. Due to the imbalanced nature of the problem, we reported averaged F1 score per movement as the main performance metric, similar to prior experiments [32, 31]. Detailed information about the training hyper-parameters and the network architectures is provided in the Appendix.

Models without conv with conv
Prediction Horizon

GRU [3]
NBoF [20]
NBoF-CA (our)
NBoF-TA (our)
TNBoF [19]
TNBoF-CA (our)
TNBoF-TA (our)
Prediction Horizon
GRU [3]
NBoF [20]
NBoF-CA (our)
NBoF-TA (our)
TNBoF [19]
TNBoF-CA (our)
TNBoF-TA (our)
Prediction Horizon
GRU [3]
NBoF [20]
NBoF-CA (our)
NBoF-TA (our)
TNBoF [19]
TNBoF-CA (our)
TNBoF-TA (our)

TABLE I: Results on the FI2010 dataset. The second column shows performances (averaged F1 in %) of all models without using any convolution layer as preprocessing layers. The third column shows the corresponding performances (average F1 in %) when additional convolution layers were used as preprocessing layers. The best results in each column are highlighted in bold-face

The experiment results for FI2010 are shown in Table I. In the second column of Table I, we list the performances of all models without using any convolution layers as the preprocessing layers. That is, the results in the second column of Table I are produced by architectures consisting of only the layer of interests (such as GRU, NBoF, and so on), plus the fully connected layers for generating predictions. In this setting, the GRU models outperform all variants of the NBoF model. This is expected since the NBoF model, by construction, is not designed to capture local features and long-term dependency in the input sequence. We can easily observe that this limitation can be partially overcome with the TNBoF variant, which uses two separate codebooks to capture the short-term and long-term dependency. By applying our proposed attention mechanism, performances of both the NBoF and TNBoF models are further boosted.

The third column of Table I shows the performances of all models when using two additional convolution layers as the local feature extractor, prior to applying the layer of interest. It is clear that all of the models benefit from the additional convolution layers, especially the NBoF model and its variants. In this setting, the GRU models no longer dominate the family of NBoF models. In fact, the GRU models become the worst-performing ones in the third column of Table I. Furthermore, both codebook attention (NBoF-CA, TNBoF-CA) and temporal attention (NBoF-TA, TNBoF-TA) consistently enhance the baselines’ performances, making attention-based models the best-performing ones.

Here we should note that although the baseline models (NBoF, TNBoF) use the adaptive scaling step proposed in [20]

to improve training stability, we did not employ this step in attention-based models. The reason stems from the fact that adaptive scaling introduces additional degrees of freedom to the quantization step, which counteracts the constraining effects of the attention mechanism. Table

II shows the performances of attention-based models on the FI2010 dataset, with and without the adaptive scaling step proposed in [20]. In most cases, the adaptive scaling step slightly degrades the performances of the attention-based models. As we will see in the next subsection, this effect is more noticeable in audio datasets.

Models adaptive scale no adaptive scale
Prediction Horizon
NBoF-CA
NBoF-TA
TNBoF-CA
TNBoF-TA
Prediction Horizon
NBoF-CA
NBoF-TA
TNBoF-CA
TNBoF-TA
Prediction Horizon
NBoF-CA
NBoF-TA
TNBoF-CA
TNBoF-TA

TABLE II: Performances (averaged F1 in %) of attention-based models on the FI2010 dataset, with and without the adaptive scaling step proposed in [20]. The best results in each row are highlighted in bold-face

Iv-B Audio Analysis Experiments

Models without conv with conv
FMA Dataset

GRU [3]
NBoF [20]
NBoF-CA (our)
NBoF-TA (our)
TNBoF [19]
TNBoF-CA (our)
TNBoF-TA (our)
TUT-UAS2018 Dataset
GRU [3]
NBoF [20]
NBoF-CA (our)
NBoF-TA (our)
TNBoF [19]
TNBoF-CA (our)
TNBoF-TA (our)

TABLE III: Audio analysis results of FMA and TUT-UAS2018 datasets. The second column shows performances (test accuracy in %) of all models without using any convolution layer as preprocessing layers. The third column shows the corresponding performances (test accuracy in %) when additional convolution layers were used as preprocessing layers. The best results in each column are highlighted in bold-face.

One of the important types of sequence data is audio recordings. In this subsection, we present our empirical analysis using two audio datasets, representing two different applications in audio signal analysis: music genre recognition and acoustic scene classification.

In the first application, the objective is to train an acoustic system that recognizes the genre of a short musical recording. For this purpose, we conducted experiments using the small subset of the FMA dataset [7], which contains tracks coming from the most popular genres: pop, instrumental, experimental, folk, rock, international, electronic, and hip-hop. Each audio clip is s long, which is transformed to Mel-spectrogram representation with frequency bands using a window of ms with an overlap of ms. The preprocessing step results in the input sequence having dimensions of .

In the second application, the objective is to train an acoustic system that can classify the type of environment based on its surrounding sounds. For this application, we used the TUT-UAS2018 dataset

[16], which contains audio clips recorded from urban acoustic scenes: airport, shopping_mall, metro_station, street_pedestrian, public_square, street_traffic, tram, bus, metro, park. Similar to the FMA dataset, we also transformed each audio clip to Mel-spectrogram with frequency bands using a window of ms with an overlap of ms, which results in the input sequence of size .

For both applications, we report the test accuracy as the performance metric. Experiment results on FMA and TUT-UAS2018 dataset are shown in Table III. Performances of all models with and without using convolution layers for feature extraction are presented in the second and third columns, respectively.

Models adaptive scale no adaptive scale
FMA Dataset
NBoF-CA
NBoF-TA
TNBoF-CA
TNBoF-TA
TUT-UAS2018 Dataset
NBoF-CA
NBoF-TA
TNBoF-CA
TNBoF-TA

TABLE IV: Performances (test accuracy in %) of attention-based models on FMA and TUT-UAS2018 datasets, with and without the adaptive scaling step proposed in [20]. The best results in each row are highlighted in bold-face.

In the FMA dataset, we can easily observe significant improvements in all models when using additional convolution layers. Without any convolution layer, the NBoF and TNBoF models outperform the GRU model on average, however, with larger variances. The order reverses when additional convolution layers were used: the GRU model enjoys a huge benefit from the preprocessing layers, outperforming the NBoF and TNBoF models. In both scenarios, i.e., with or without convolution layers, the proposed attention block greatly enhances the baseline NBoF and TNBoF models, making them the best performing models in this task.

Models test accuracy
Noisy FMA Dataset

GRU [3]
NBoF [20]
NBoF-IA (our)
TNBoF [19]
TNBoF-IA (our)
Noisy TUT-UAS2018 Dataset
GRU [3]
NBoF [20]
NBoF-IA (our)
TNBoF [19]
TNBoF-IA (our)

TABLE V: Audio analysis results under noisy data setting. No preprocessing convolution layer was used in this setting.

In the TUT-UAS2018 dataset, while adding convolution layers leads to noticeable improvements for the baseline NBoF and TNBoF, we observe no similar improvement for the GRU model. Similar to the FMA dataset, we observe consistent performance boost in the TUT-UAS2018 dataset by incorporating the proposed attention block to the NBoF and TNBoF models.

Similar to Section IV-A, we also conducted experiments in FMA and TUT-UAS2018 datasets to analyze the effects of the adaptive scaling step proposed in [20]. The results are shown in Table IV. The results obtained from both audio analysis tasks in Table IV are consistent with what we observe from the stock movement prediction task in Table II: although the adaptive scaling step can enhance NBoF and TNBoF models as demonstrated in [20], the additional degrees of freedom introduced by this step negates the competition effects enforced by the attention mechanism, leading to performance degradation when combining both methods.

In order to evaluate how well the proposed input attention mechanism (NBoF-IA, TNBoF-IA) tackles noisy data, we simulated contaminated audio data by adding

synthetic frequency bands, which are generated by adding white noise to the averaged Mel coefficients. Here we should note that in this set of experiments, we did not use any convolution layers in order to gauge how well the layers of interests are resilient to noise. The results are shown in Table

V. As can be seen from Table V, when moving from the noiseless to the noisy version of FMA and TUT-UAS2018 datasets, the accuracy of NBoF and TNBoF models dropped significantly. GRU models also exhibited similar behaviors, although the performance drops are less significant as compared to the NBoF and TNBoF. By incorporating the proposed input attention block to the NBoF and TNBoF models, we were able to achieve very similar performances compared to the noiseless scenario.

Iv-C Medical Diagnosis Experiments

Models F1
GRU [3]
NBoF [20]
NBoF-CA (our)
NBoF-TA (our)
TNBoF [19]
TNBoF-CA (our)
TNBoF-TA (our)

TABLE VI: Performance (averaged F1) on AF dataset

Medical diagnosis, which plays a crucial role in ensuring human prosperity, is inherently an intricate process. The quality of the diagnosis is highly dependent on the expertise of the examiner. Since it takes several years and a great amount of resources to train human experts, medical diagnosis tools have been actively developed over the past decades to assist human examiners. In our empirical analysis using medical data, we investigated the effectiveness of the proposed models in diagnosing cardiovascular diseases using publicly available Electrocardiogram (ECG) and Phonocardiogram (PCG) signals.

The AF dataset focuses on the problem of atrial fibrillation detection from ECG recordings, which are provided as the development data (training set) in the Physionet/Computing in Cardiology Challenge 2017 [5]. The dataset contains single-lead ECG recordings lasting from to seconds. The objective of the challenge was to classify a given recording into one of the classes: normal sinus rhythm, atrial fibrillation, alternative rhythm, and noise. We followed an experimental setup similar to [1]

, which evaluates a given model using 5-fold cross-validation. Additionally, the recordings were clipped or padded so that they have a constant length of

seconds. Since a single lead ECG recording is only a univariate sequence, it is necessary to use convolution layers as preprocessing layers to extract higher-level features, before the NBoF or GRU layers. To tackle the imbalanced nature of the training set, we scaled the loss term associated with each class, with the factor inversely proportional to the number of samples in that class.

PCG signal is often used in ambulatory diagnosis in order to evaluate the heart hemodynamic status and detect potential cardiovascular problems. The data used in our experiments come from the training set provided in the Physionet/Computing in Cardiology Challenge 2016 [6]. The objective of the challenge is to develop an automatic classification method for the anomaly (normal versus abnormal) and quality (good versus bad) detection given a PCG recording.

Since the length of the recordings varies greatly, from to seconds, we generated s segments from the recordings for training the models; during the test phase, the models were used to classify s sub-segments (with s overlap) of a given recording, and the overall label is inferred from the averaged classification of the sub-segments. PCG signal captures the acoustic nature of the heart sound; thus, we extracted Mel-spectrogram with frequency bands, using a window of ms with an overlap of ms to represent each segment. With a smaller size compared to the AF dataset, we only employed a 3-fold cross-validation protocol for this problem. Further details regarding our experimental setup in AF and PCG datasets are provided in the Appendix.

Models Anomaly Detection Quality Detection

GRU [3]


NBoF [20]
NBoF-CA (our)
NBoF-TA (our)
TNBoF [19]
TNBoF-CA (our)
TNBoF-TA (our)

TABLE VII: Performance (mean of sensitivity and specificity) on PCG dataset. The higher, the better.

Table VI shows the averaged F1 score, a metric adopted by the database [5], of all models on the AF dataset. In Table VII, we show the anomaly and quality detection performance. The performance metric used by the database [6] is calculated as the mean of sensitivity and specificity scores. In the AF dataset, the averaged F1 scores obtained from the baseline NBoF and TNBoF models are significantly higher than the one obtained from the recurrent model. Although the improvement margins for the NBoF model are minor in Table VI, both the NBoF and TNBoF models enjoy increases in performance when using the proposed attention blocks. The consistent performance gain produced by the attention blocks can also be observed in the PCG dataset in Table VII. In this dataset, while the NBoF and TNBoF models score far below the GRU model, the attention-based models perform nearly as well as the recurrent model.

V Conclusions

In this paper, we proposed 2D-Attention, a generic attention mechanism for data represented in the form of matrices. The proposed attention computation can be used in a plug-and-play manner, and can be updated jointly with other components in a computation graph. Using the proposed attention block, we further proposed three variants of the Neural Bag-of-Features model when learning with sequence data. Our extensive experiments in financial forecasting, audio analysis and medical diagnosis demonstrated that the proposed attention consistently led to performance gains for the Neural Bag-of-Features models. Since 2D-Attention is a generic attention computation method for matrices, investigating its efficacy in other neural network models is an interesting research direction in the future works.

Vi Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR). This publication reflects the authors’ views only. The European Commission is not responsible for any use that may be made of the information it contains.

In all of our experiments, we used ADAM optimizer for stochastic optimization. Weight decay () or max-norm constraint () was used to for regularization. In addition, dropout () was applied to the output of the layer before the classification layer. In all models, before the output layer, there is a fully-connected layer with

neurons. For NBoF, TNBoF and the attention models, we used

codewords in the quantization layer. Correspondingly, the number of units in GRU model was set to . Details that are specific to each experiment are provided below:

  • Financial Forecasting Experiments: All models were trained for epochs, with the initial learning rate set to . The learning rate was decreased by a factor of at epoch and . We followed [31] and scaled the loss term associated with each class with a factor that is inversely proportional to the number of samples of each class to counter the effect of class imbalanced. In experiments that used convolution layers as preprocessing layers, we used two 1D convolution layers, each of which has filters with the filter size set to

    and the stride set to

    . Batch normalization was used after each convolution layer, followed by the ReLU activation.

  • Audio Analysis Experiments: The setup is similar to the financial forecasting experiments, except for the configuration of convolution layers: four 1D convolution layers with the filter size of were used; the first two convolution layers have

    filters, which are followed by a max-pooling layer to reduce the temporal dimension by half. The last two convolution layers have

    filters. After each convolution layer, we applied batch normalization, followed by ReLU activation.

  • Medical Diagnosis Experiments: in both AF and PCG datasets, all models were trained for epochs, with the initial learning rate set to , which was decreased to at epoch , then to at epoch . For the AF dataset, we adopted the convolution architecture proposed in [1] as the first computation block in all models. For PCG dataset, we used five 1D convolution layers with the filter size set to as the preprocessing layers: the first two layers have filters with strides of ; the third layer has filters with strides of ; the fourth layer has filters with strides of ; the last layer has filters with strides of . After each convolution layer, we applied batch normalization, followed by ReLU activation.

References

  • [1] F. Andreotti, O. Carr, M. A. Pimentel, A. Mahdi, and M. De Vos (2017) Comparing feature-based classifiers and convolutional neural networks to detect arrhythmia from short segments of ecg. In 2017 Computing in Cardiology (CinC), pp. 1–4. Cited by: 3rd item, §IV-C.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I, §II, §III-A.
  • [3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §I, TABLE I, TABLE III, TABLE V, TABLE VI, TABLE VII, §IV.
  • [4] Y. G. Cinar, H. Mirisaee, P. Goswami, E. Gaussier, A. Aït-Bachir, and V. Strijov (2017) Position-based content attention for time series forecasting with sequence-to-sequence rnns. In International Conference on Neural Information Processing, pp. 533–544. Cited by: §II.
  • [5] G. D. Clifford, C. Liu, B. Moody, H. L. Li-wei, I. Silva, Q. Li, A. Johnson, and R. G. Mark (2017) AF classification from a short single lead ecg recording: the physionet/computing in cardiology challenge 2017. In 2017 Computing in Cardiology (CinC), pp. 1–4. Cited by: §IV-C, §IV-C.
  • [6] G. D. Clifford, C. Liu, B. Moody, D. Springer, I. Silva, Q. Li, and R. G. Mark (2016) Classification of normal/abnormal heart sound recordings: the physionet/computing in cardiology challenge 2016. In 2016 Computing in Cardiology Conference (CinC), pp. 609–612. Cited by: §IV-C, §IV-C.
  • [7] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson (2016) Fma: a dataset for music analysis. arXiv preprint arXiv:1612.01840. Cited by: §IV-B.
  • [8] A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §I.
  • [9] A. Iosifidis, A. Tefas, and I. Pitas (2012) Multidimensional sequence classification based on fuzzy distances and discriminant analysis. IEEE Transactions on Knowledge and Data Engineering 25 (11), pp. 2564–2575. Cited by: §I.
  • [10] A. Iosifidis, A. Tefas, and I. Pitas (2014) Discriminant bag of words based representation for human action recognition. Pattern Recognition Letters 49, pp. 185–192. Cited by: §I.
  • [11] Y. Jiang, C. Ngo, and J. Yang (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM international conference on Image and video retrieval, pp. 494–501. Cited by: §I.
  • [12] F. Laakom, N. Passalis, J. Raitoharju, J. Nikkanen, A. Tefas, A. Iosifidis, and M. Gabbouj (2019) Bag of color features for color constancy. arXiv preprint arXiv:1906.04445. Cited by: §I, §II.
  • [13] S. Lazebnik, C. Schmid, and J. Ponce (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    ,
    Vol. 2, pp. 2169–2178. Cited by: §I.
  • [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §I.
  • [15] D. G. Lowe (1999) Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2, pp. 1150–1157. Cited by: §I.
  • [16] Cited by: §IV-B.
  • [17] V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §I, §I.
  • [18] A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2018)

    Benchmark dataset for mid-price forecasting of limit order book data with machine learning methods

    .
    Journal of Forecasting 37 (8), pp. 852–866. Cited by: §I, §IV-A, §IV-A.
  • [19] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2018) Temporal bag-of-features learning for predicting mid price movements using high frequency limit order book data. IEEE Transactions on Emerging Topics in Computational Intelligence. Cited by: §II, TABLE I, TABLE III, TABLE V, TABLE VI, TABLE VII, §IV.
  • [20] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2019) Temporal logistic neural bag-of-features for financial time series forecasting leveraging limit order book data. arXiv preprint arXiv:1901.08280. Cited by: §II, §II, §IV-A, §IV-B, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V, TABLE VI, TABLE VII, §IV.
  • [21] N. Passalis and A. Tefas (2016) Entropy optimized feature-based bag-of-words representation for information retrieval. IEEE Transactions on Knowledge and Data Engineering 28 (7), pp. 1664–1677. Cited by: §I.
  • [22] N. Passalis and A. Tefas (2017) Neural bag-of-features learning. Pattern Recognition 64, pp. 277–294. Cited by: §I, §I, §II.
  • [23] N. Passalis, A. Tsantekidis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Time-series classification using neural bag-of-features. In 2017 25th European Signal Processing Conference (EUSIPCO), pp. 301–305. Cited by: §I, §I.
  • [24] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell (2017) A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971. Cited by: §I, §II, §III-A, §III-B.
  • [25] M. Riley, E. Heinen, and J. Ghosh (2008) A text retrieval approach to content-based audio retrieval. In Int. Symp. on Music Information Retrieval (ISMIR), pp. 295–300. Cited by: §I.
  • [26] O. B. Sezer, M. U. Gudelek, and A. M. Ozbayoglu (2020)

    Financial time series forecasting with deep learning: a systematic literature review: 2005–2019

    .
    Applied Soft Computing 90, pp. 106181. Cited by: §IV-A.
  • [27] E. Slutzky (1937) The summation of random causes as the source of cyclic processes. Econometrica: Journal of the Econometric Society, pp. 105–146. Cited by: §I.
  • [28] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley (2015) Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia 17 (10), pp. 1733–1746. Cited by: §I.
  • [29] G. C. Tiao and G. E. Box (1981) Modeling multiple time series with applications. journal of the American Statistical Association 76 (376), pp. 802–816. Cited by: §I.
  • [30] D. T. Tran, M. Gabbouj, and A. Iosifidis (2017) Multilinear class-specific discriminant analysis. Pattern Recognition Letters 100, pp. 131–136. Cited by: §I.
  • [31] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj (2018) Temporal attention-augmented bilinear network for financial time-series data analysis. IEEE transactions on neural networks and learning systems 30 (5), pp. 1407–1418. Cited by: 1st item, §I, §I, §I, §II, §III-B, §IV-A.
  • [32] D. T. Tran, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosifidis (2017) Tensor representation in high-frequency financial data for price change prediction. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. Cited by: §I, §I, §IV-A.
  • [33] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §I, §I, §III-A.
  • [34] B. Zhao, H. Lu, S. Chen, J. Liu, and D. Wu (2017) Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics 28 (1), pp. 162–169. Cited by: §I.