Multi Resolution LSTM For Long Term Prediction In Neural Activity Video

05/08/2017 ∙ by Yilin Song, et al. ∙ 0

Epileptic seizures are caused by abnormal, overly syn- chronized, electrical activity in the brain. The abnor- mal electrical activity manifests as waves, propagating across the brain. Accurate prediction of the propagation velocity and direction of these waves could enable real- time responsive brain stimulation to suppress or prevent the seizures entirely. However, this problem is very chal- lenging because the algorithm must be able to predict the neural signals in a sufficiently long time horizon to allow enough time for medical intervention. We consider how to accomplish long term prediction using a LSTM network. To alleviate the vanishing gradient problem, we propose two encoder-decoder-predictor structures, both using multi-resolution representation. The novel LSTM structure with multi-resolution layers could significantly outperform the single-resolution benchmark with similar number of parameters. To overcome the blurring effect associated with video prediction in the pixel domain using standard mean square error (MSE) loss, we use energy- based adversarial training to improve the long-term pre- diction. We demonstrate and analyze how a discriminative model with an encoder-decoder structure using 3D CNN model improves long term prediction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Studies have focused on seizure prediction for decades, but reliable prediction of seizure activity many minutes before a seizure has been elusive. Constructed features like wavelet, energy of spike and spectral power [4, 19, 7, 8, 22, 6, 28, 32, 1] are applied on electroencephalogram (EEG) or electrocorticographic (ECoG) data with coarse resolution for most of current neurological analysis works. Prior work has focused on predicting seizures minutes or hours in advance of the seizure, using supervised datasets with labeled examples of seizures. However, with the rich spatial and temporal patterns unveiled by high resolution micro-electrocorticographic (ECoG) [35]

which is very similar to a high-frame rate video signal, accurate prediction of neural activities at the sub-second level could become a very interesting and tractable problem. Neural signal prediction on this time frame would allow responsive stimulation to suppress seizures. This kind of neural signal prediction could also find a compact representation for neural activity which could lead to understanding non-pathologic neural activity. To capture the highly non-linear dynamics in neural activities, deep learning neural networks appear to be a promising solution. But, learning a compact representation for neural video prediction gets more challenging when trying to predict in a longer future and the deep neural network model develops a more severe vanishing gradient problem.

To model long term dependencies, Long Short Term Memory (LSTM) units

[14]

were proposed as an improvement over vanilla RNN to solve vanishing gradients problem by introducing gate functions. Gated Recurrent Unit (GRU)

[5] as simplified version of LSTM units has achieved better performance in a number of applications [16, 33, 3]. Even though LSTM and GRU tries to solve vanishing gradient problem by preserve long term dependency in their cells, modeling long term dependencies is still difficult. For neural language translation, instead of decoding a sequence from a compact feature learnt through an encoding network, [3, 25]

use word-by-word attention mechanism which allow direct connection between premise and hypothesis sentences. By using such direct connections, it alleviates the vanishing gradient problem for long sentence translation. Another different approach is to use memory augmented neural networks as Neural Turing Machines

[10]. By using external memory to store information, the explicit storage of hidden states creates a shortcut through time. [27, 11] both achieve good performance by using external memory network.

Figure 1: Video prediction framework. The generative model is built using convolutional LSTM [36]. The network flow is represented with solid arrow, whereas the losses for the generative model are represented with dashed arrows.

Because convolution account for short-range dependencies, to capture long term correlation, CNN based models have to either increase the depth of the model or use a larger receptive field and larger stride. Either way is generally considered not optimal in time series prediction. Work in

[30, 13] creates efficient information flow from lower layer to high layer by using residual modules and skip connections. [23] used residual module between depth layer in both CNN and RNN based image pixel prediction. Because time series such as audio and video have high temporal correlation, to increase the receptive field with same number of parameters [34] used diluted convolution kernel and achieve good performance. Inspired by diluted convolution in [34], we propose a LSTM network that uses multi-resolution layers. The higher layer skips each temporal connections to create a shortcut, while lower layer is temporally connected and preserves the fine grained information. We also experiment with an explicitly multi-resolution LSTM structure that resembles a temporal pyramid. We demonstrate both multi-resolution representations improve long term prediction compare to a benchmark LSTM.

Learning long term dependencies not only needs an appropriate network structure but also needs a suitable loss function. For video prediction, to overcome blurry predictions caused by using pixel-wise mean square error (MSE),

[21] added total variation loss. Video prediction could be consider as a special case of domain transfer, where the past observed frames lies on one data manifold and future frame lies on another one. Adversarial training finds the relationship between these two manifolds. [20, 21] add adversarial loss on top of MSE. But how video prediction benefits from adversarial training are not fully understood. To further understand how adversarial training benefits video prediction, we use a encoder-decoder 3D CNN structure for the discriminative model. The discriminative model uses reconstruction error as its loss rather than KL-divergence measure. This resembles energy-based GAN in [37] versus GAN [9].

Ii Framework

In this section, we describe the general structure of our neural video prediction model. The structure consists of two different models, a generative model and a discriminative model. The generative model first takes past observations of video sequences as input and learns a compact feature representation, from which the generative model then reconstructs the past frames and predicts future frames. We explore different model structures during the experiments, and use convolutional LSTM [36] as basic building block for generative model. The discriminative model structure is nearly the same in all experiments. Its main goal is to determine whether the future frames are generated conditioned on the true past frames. Together with the generative model, these two models are considered as adversarial training [9]. The general structure is shown in Fig. 1.

Ii-a Generative Model

Let denotes a video sequence, where denotes current observation, denotes the th frame in the future to predict. The generative network takes as input and outputs a sequence .

In our approach the generative model has an encoder network, a decoder network and a predictor network similar as [29]. These networks all use convolutional LSTM [36] as basic computation module. The encoder network takes as input and generates a representation . The decoder network reconstructs from . The decoder LSTM is set to be a conditional model namely the decoder reconstructs from . The predictor network generates from . The predictor model is also a conditional model and it conditions on to predict . The loss for the generative model consists four parts:

(1)

Z is a four dimensional tensor of size

, with , and represents channel, height and width of the frame respectively. Z is constructed by stacking and in time order. , are the pixel domain loss for the reconstructed frames and predicted frames respectively. Whereas is the adversarial loss from the discriminative model.

Figure 2: Understanding the benefit from adversarial training: The input to the discriminative model is either true history with true future or true history with predicted future. Third and fifth row of each example shows the activation of second to last layer output across all channel. The activation by true data is distributed almost evenly in both space and time domain to reconstruct the entire sequence. The activation by the sequence with predicted future however concentrates on spatial and temporal inconsistencies. For example, in the first sequence, the discriminative model finds the inconsistency in the last few frames.

Ii-B Discriminative Model

(a) benchmark network
(b) multi-resolution LSTM
(c) LSTM with multi-resolution layers
Figure 3: Benchmark network, multi-resolution LSTM and LSTM with multi-resolution layers. The multi-resolution LSTM has two scales, and in each scale it has two layer structure. Only one layer is drawn per scale for simplicity. The dotted box represents the predicted frames and

represents linear interpolated frames.

Generative adversarial network(GAN) were introduced by [9], where image is generated from random noise by using two networks trained in a competing manner. The discriminative model in [9] minimize the KL-divergence between true image distribution and generated image distribution. The original GAN structure suffers from convergence problem and collapsing mode [26, 2]. To solve those problems, [26] introduce several techniques including feature matching, minibatch discrimination and historical averaging. [37] use image reconstruction loss instead of KL-divergence loss. [24] derive a stable deep convolutional GAN structure by modifying modules in both generator and discriminator. [2] show the true data distribution and generative data distribution manifolds in high dimensional space hardly have any overlap, and Wasserstein distance is better compare to other distance measures for non-overlapping distribution. [2] achieved the state of art performance for image generation.

For adversarial training for domain transfer problem, where one generates a sample in target domain condition on the data in source domain. In domain transfer unlike GAN, whose generative model could easily suffer from mode collapsing [26, 24], overlaps between source domain and target domain manifold is easier to find. [18, 31, 20, 21] all used modest model structure to perform domain transfer task and achieved good performance. Video prediction could be considered as a domain transfer problem, where the past frames embedding lies on one manifold and future embedding lies on another manifold. [20]

concatenates the LSTM features of past frames and CNN feature of generated frame to train a separate multilayer perceptron.

[21] uses a multi-scale 2d convolutional network, the discriminative model stacks all input frames in the channel dimension and output a single scalar indicating whether the video frames are generated or from ground truth future. But both networks fail to model the temporal correlation between frames explicitly.

More importantly, it is not fully understood how adversarial training benefits video prediction. To exploit the temporal dependancies, we use an auto encoder and decoder 3D CNN structure as our discriminative model. The discriminative model uses the energy as the loss function. Energy-based model finds compact representation for the sequence which lives on a low dimension manifold.

[37] demonstrate the energy-based GAN training has advantage over GAN for image generation. Another benefit of using encoder-decoder structure is by mapping the activation into pixel space, it helps understanding how adversarial training benefits video prediction. Figure 2 shows the activation in the second to last layer of the discriminative model when provided different input. The loss for discriminative model is:

(2)

The Dec and Enc in Eq. 2 refers to the encoder and decoder in the discriminative model.

Iii Benchmark and Multi-resolution Network

For neural video prediction, to capture the long term dependencies, we propose two different network structures: multi-resolution LSTM and LSTM with multi-resolution layers. For all generative models, the basic building block is ConvLSTM [36]. Each ConvLSTM layer at each time takes as input, and has memory cell state , hidden state and gates . The equation we use for ConvLSTM are shown in Eq. 3, where denotes convolution operator and denotes Hadamard product. For all generative models, they all have an encoder, a decoder and a predictor.

(3)

Iii-a Benchmark Network

First we introduce the benchmark ConvLSTM model, which is a two layer ConvLSTM structure shown in Fig. 3(a). In the benchmark model, each convolution LSTM layer uses a convolution kernel of size . For the predictor and decoder convLSTM in the model, the output of both layers go through a deconvolution layer. The deconvolution layer uses kernel size of and outputs a frame, which is essentially a weighted average of all input feature maps followed by a function.

Iii-B Multi-resolution LSTM

In this approach, in general, we generate temporal scales of the training sequences. The original sequence constitutes scale 0, and the upper scales are recursively down-sampled from the lower scale by a factor of 2. The top scale (coarsest resolution) works in the same way as the benchmark network over its samples only. The lower scale considers both the samples in that scale as well as the interpolated samples from the upper scale. We only present the 2-scale case here for simplicity. In order to avoid delay, we use the simple averaging of the current and the previous sample in the lower scale as the anti-aliasing filter for downsampling. Specifically let represents the true video frame at time . Scale 1 signal at even time samples is produced by:

(4)

To interpolate the odd samples at scale 1 from even samples, we use simple averaging interpoation filter. The interpolated signal from scale 1 is

As shown in Fig. 3(b), we first predict even samples at scale 1. We then interpolate the odd future samples from the predicted even samples, to generate all predicted samples, , from scale 1. We then predict samples at scale 0 using both past samples at scale 0 and current predicted sample at scale 1. Specifically, we predict samples at using the features learned up to time (from both scales), the actual or predicted sample at at scale 0, as well as the predicted sample at at scale 1, i.e., . The two inputs and to the ConvLSTM predictor are simply stacked as two channels at the same time. In each scale, the generative model is trained by minimizing the loss function at that scale :

(5)

where is a four dimensional tensor by stacking the true past frames and predicted frames for scale in time order. The illustration of multi-scale structure is shown in Fig. 3(b). In each scale, the LSTM network has exactly the same two-layer ConvLSTM structure as the benchmark model, except the input frame of scale 0 have twice the number of channels compared to scale 1: half from the current scale and another half from the upper scale. The comparison of 2-scale multi-resolution LSTM and single scale prediction are shown in Fig. 4.

Figure 4: Comparison between 2-scale and single scale model for video prediction. For single scale video prediction, the encoder, decoder and predictor in the generative model each uses two layer convolutional LSTM. In 2-scale prediction, for each scale the model have the same network structure as the single scale benchmark. The single scale model and multi-resolution LSTM each corresponds to model 6 and 7 in Tab. I.

Iii-C LSTM with Multi-resolution layer

Figure 5: Demonstration of the high correlation of the first layer output of the two-layer LSTM model at the original resolution. Each layer used 2 layer convolutional LSTM each with 128 convolutional kernels with receptive field of . Row 1 and 3 show the encoder observation for two different sequences. Row 2 and 4 show the average of the 128 feature maps produced by layer 1.

For time series prediction because the temporal correlation is high, in order to achieve a larger receptive field with the same amount of parameters, [34] uses diluted convolution where the convolutional filter in higher layers of the CNN network are structured with zero coefficients every other connections.

Inspired by [34], we propose a LSTM network that has multi-resolution layers. The network have a higher layer and a lower layer. The fine-grained temporal resolution is preserved by the lower layer shown in in Fig. 5. The higher layer of the convolutional LSTM model use a skip temporal connection shown in Fig. 3(c). Compare to the lower layer, the higher layer creates a temporal highway, which alleviates the vanishing gradient problem. Different from the benchmark network shown in Fig. 3(a), the deconvolution layer in multi-resolution layer network (Fig. 3(c)) use different parameters to predict. In our implementation, the deconvolution layer is performing spacial convolution on the feature map outputs, and the increase number of parameter compare to Fig. 3(a) is almost negelectable. The number of parameters used in different models are shown in Tab. I.

Iv Experiment

Iv-a Dataset

We analyzed ECoG data from an acute in vivo feline model of seizures. The 18 by 20 array of high-density active electrodes has 500 m spacing between nearby channels. The in vivo recording has a temporal sampling rate of 277.78 Hz and lasts 53 minutes. We obtained a total of 894K frames. In total, there are 788 K consecutive training frames and 106K consecutive testing frames. During training, we use 16 frames as observation to predict the next 16 frames.

Iv-B Results

Figure 6: PSNR of predicted frames against prediction time. The benchmark model, benchmark model with adversarial, multi-resolution LSTM, LSTM with multi-resolution layer correspond to models 5,6,7,8 respectively in Tab. I. LSTM with multi-resolution layer has a better long term prediction accuracy compared to other models. The PSNR is obtained by first computing MSE by averaging squared errors over all pixels over all frames and all sequences, and then converting the resulting MSE to PSNR.

For the discriminative model, we use a 3D convolutional neural network with an encoder-decoder structure. The encoder uses three 3D strided convolution layers with all layers using

convolutional kernel. The decoder uses three 3D strideded deconvolution layers with all layers with the same receptive field. We use batch normalization

[15]

and leaky ReLU

[12] except for the last layer. For multi-resolution network, the discriminative for the higher scale uses only two 3D convolution layer and deconvolution layer as the sequence length is reduced by 2.

We present results for several models. The weights for reconstruction error and prediction error are set to be in all models. For models use adversarial training, we set the weight . For multi-resolution LSTM network, the weights are set the same in different scale. The discriminative model and generative model are trained both use Adam algorithm[17]

both with learning rate of 0.001 decreasing by a factor of 10 halfway through training. To avoid exploding gradient for generative model, we perform gradient clipping by setting the

norm maximum at 0.001. In all adversarial training case, the discriminative model is updated once every two iterations.

During testing stage, the observation sequences lasts 16 frames and the predictor generates 16 future frames based on the observed frames. To evaluate the performance of different approaches we compute the Peak Signal to Noise Ratio (PSNR) between the true future frames and predicted future frames .

Sample prediction comparison are shown in Tab. I. The adversarial training brings improvement compared to using loss alone. It is interesting to note that even the PSNR is based on metric, adding adversarial training into the loss function gets better prediction accuracy. Discriminative model helps generative model to learn the long term dependencies. The further the prediction the more significant is the accuracy gain from adversarial training, which is shown in Fig. 6. Comparing among all structures using adversarial learning, LSTM with multi-resolution layer and multi-resolution LSTM both have a significant gain compared to the benchmark model. Even though the multi-resolution LSTM achieved more gains for prediction up to 10 frames ahead, the LSTM with multi-resolution layers take over after 10 frames. PSNR increase by using LSTM with multi-resolution layer at frames gets as high as 1.92 dB compared to the benchmark model. This is remarkable as the LSTM with multi-resolution layers have about the same number of parameters as the benchmark model. In Fig. 7, we show sample results of different models.

Generative model ConvLSTM 64-64 ConvLSTM 64-64 Multi-resolution LSTM 64-64, 64-64 LSTM with multi-resolution layer 64-64 ConvLSTM 128-128 ConvLSTM 128-128 Multi-resolution LSTM 128-128, 128-128 LSTM with multi-resolution layer 128-128
Number of parameters in the generative model 4123266 4123266 4123266 and 8265732 4123524 15619330 15619330 15619330 and 31277060 15619844
Discriminative model number of feature maps per layer None 32,32,4,32,32 32,4,32 and 32,32,4,32,32 32,32,4,32,32 None 32,32,4,32,32 32,4,32 and 32,32,4,32,32 32,32,4,32,32
PSNR of all frames 27.8737 28.3426 28.8903 29.0372 27.9942 28.5317 29.0931 29.1741
Table I: Comparison of the accuracy and number of parameters of all models. 64-64 represents the number of convolution LSTM cells in layer 1 and 2 are both 64. All the convolution LSTM cells uses kernel. The multi-resolution LSTM structure has two scales, each scale has two layer convolution LSTM cells. 64-64, 64-64 means each scale uses a 64-64 two layer LSTM. Model 1 and 5 are trained with loss alone.

V Acknowledgement

This work was funded by National Science Foundation award CCF-1422914.

Vi Conclusion

In this work, we have proposed two ways to do video prediction using multi-resolution presentations. The first approach uses a novel LSTM structure with multi-resolution layers for long term video prediction. The network creates a temporal highway in the upper-layer to capture the long-term dependencies between video frames. The second approach uses two scale multi-resolution LSTM. We compare the performance of these two approaches against single resolution benchmark model and demonstrate the advantage of using multi-resolution representation of LSTM. Both multi-resolution LSTM and LSTM with multi-resolution layers have better performance than single resolution representation when they all use adversarial training. The long term prediction accuracy using LSTM with multi-resolution layers are much higher than the benchmark models with similar number of parameters. We also demonstrate that all models benefit from energy-based adversarial training which is accomplished by using a 3D CNN based encoder-decoder structure.

Figure 7: Prediction result comparison between different methods: generative model, adversarial training, multi-resolution LSTM and LSTM with multi-resolution layers correspond to model 5,6,7,8 respectively in Tab. I.

References