1 Introduction
Learning a formal grammar from continuous, unstructured data is a challenging problem. This is especially challenging when the elements (i.e., terminals) of the grammar to be learned are not symbolic or discrete (Chomsky, 1956, 1959)
, but are higher dimensional vectors, such as representations from real world data streams (e.g., videos).
Simultaneously, addressing such challenges is necessary for better automated understanding of streaming data. In video understanding, such as activity detection, a spacetime convolutional neural network (CNN) (e.g.,
(Carreira & Zisserman, 2017)) generates a representation abstracting local spatiotemporal information at every time step, forming a temporal sequence of representations. Learning a grammar reflecting sequential changes in video representations will enable explicit and highlevel modeling of temporal structure and relationships between multiple occurring events in videos. This not only allows for better recognition of human activities from videos by enforcing the learned grammar to locallevel detections, but also enables forecasting of future representations based on the learned production rules. It also provides semantic interpretability of the video recognition and prediction process.In this paper, we propose a new approach of modeling a formal grammar in terms of learnable and differentiable neural network functions. The objective is to formulate not only the terminals and nonterminals of our grammar as learnable representations but also the production rules generating them as differentiable functions. We provide the loss function to train our differentiable grammar directly from data, and present methodologies to take advantage of it for recognizing and forecasting sequences. Rather than focusing on nonterminals and production rules to generate or parse symbolic data (e.g., text strings), our approach allows learning of grammar representations directly on top of higherdimensional data stream (e.g., representation vector sequences). We confirm such capability experimentally by focusing on learning a differentiable regular grammar from continuous representations, which can be applied to any sequential data including outputs of 3D spacetime CNNs.
The primary contributions of our work are:

[noitemsep]

Design of a fully differentiable neural network that is able to learn the structure (terminals, nonterminals, and production rules) of a regular grammar.

The grammar model is easily interpretable, enabling understanding of the structures learned from data.

We confirm that the approach works on sequential realworld datasets, and outperforms the stateoftheart on challenging benchmarks.

We show that the model is able to achieve better results on future forecasting of human activities which are to occur subsequently in videos.
The goal of this work is to provide to the research community a neural differentiable grammarbased matching and prediction for video analysis, which is also applicable to other domains. The results are interpretable which is very important for reallife decision making scenarios. Furthermore, it can predict with higher accuracy future events, which is crucial for anticipation and reaction to future actions, for example for an autonomous robot which interacts with humans in dynamic environments.
2 Background
A formal grammar is defined with four elements: where is a finite set of nonterminals, is a finite set of terminals, is a finite set of production rules, and is the starting nonterminal.
In a regular grammar, the production rules are in the following forms:
(1) 
where and are nonterminals in , is any terminal in , and denotes an empty string. A regular grammar is a type 3 formal grammar in the Chomsky hierarchy.
In this paper, we follow this traditional regular grammar definition, while extending it by making its terminals, nonterminals, and production rules represented in terms of differentiable neural network functions. Our differentiable grammar could be interpreted as a particular form of recurrent neural network (RNN). The main difference to the standard RNNs such as LSTMs and GRUs
(Hochreiter & Schmidhuber, 1997; Cho et al., 2014)is that our grammar explicitly maintains a set of nonterminal representations (in contrast to having a single hidden representation in standard RNNs) and learns multiple distinct production rules per nonterminal. This not only makes the learned model more semantically interpretable, but also allows learning of temporal structures with multiple sequence possibilities. Our grammar, learned with a randomized production rule selection function, considers multiple transitions between abstract nonterminals when matching it with the input sequences as well as when generating multiple possible future sequences.
We also experimentally compare our grammar with previous models including LSTMs (Hochreiter & Schmidhuber, 1997)
and Neural Turing Machines (NTMs)
(Graves et al., 2014) in the experiments section.3 Approach
3.1 Formulation
We model our formal grammar in terms of latent representations and differentiable functions mapping to representations. The parameters of our functions define production rules, which are learned together with the terminal and nonterminal representations.
Each nonterminal in is a latent representation with fixed dimensionality, whose actual values are learned based on the training data. Each terminal in corresponds to a video representation that could be obtained at every time step, such as a vector with activity class predictions. This has to be learned as well. Our production rules are represented as a pair of two functions:

: a function that maps each nonterminal in (e.g., ) to a subset of production rules (i.e., the rules for the current nonterminal) .

: a function that maps each rule to a terminal (e.g., ) and the next nonterminal (e.g., ).
(2) 
The combination of the two functions effectively captures multiple production rules per nonterminal, such as “” and “”. The starting nonterminal is learned to be one of the latent representations in . The functions are learned from data.
These form a straight forward (recursive) generative model, which starts from the starting nonterminal and iteratively generates a terminal at every time step. Representing our production rules as functions allows us to model the generation of a sequence (i.e., a string) of terminals as the repeated application of such functions. At every time step , let us denote the first function mapping each nonterminal to a set of production rules as , and the second function mapping each rule to a nonterminal/terminal pair as where , and . is a latent vector describing the production rule activations corresponding to .
In its simplest form, we can make our grammar rely only on one production rule by applying the softmax function () to the activation vector : . This formulation makes a (soft) onehot indicator vector selecting the th production rule. Our sequence generation then becomes:
(3) 
We represent each as a dimensional soft onehot vector where is the number of nonterminals. In the actual implementation, this is constrained by having a softmax function as a part of to produce . Each is a dimensional representation we learn to generate, where is the dimensionality of the sequential representation at every time step. This process is shown in Fig. 2.
We further extend Eq. 3 to make the grammar consider multiple production rules in a randomized fashion during its learning and generation. More specifically, we use the GumbelSoftmax trick (Jang et al., 2017; Maddison et al., 2017) to replace the softmax in Eq. 3. Treating the activation vector as a distribution over production rules, the GumbelSoftmax () allows sampling of different production rules:
(4) 
In our case, this means that we are learning the grammar production rules which could be selected/sampled differently even for the same nonterminal (i.e., ) while still maintaining a differentiable process.
The idea behind our grammar formulation is to allow direct training of the parameters governing generation of the terminals (e.g., video representations in our case), while representing the process in terms of explicit (differentiable) production rules. This is in contrast to traditional work that attempted to extract grammar from alreadytrained standard RNNs (Gers & Schmidhuber, 2001) or more recent neural parsing works using discrete operators (Dyer et al., 2016) and memorybased RNNs (Graves et al., 2014). Our formulation also adds interpretability/explainability to our temporal models learned from data streams, as we confirm more in the following subsections.
Detailed implementation of production rule functions:
Although any other differentiable functions could be used for modeling our functions and , we use matrix operations to implement them. Given a matrix of production rules, , a matrix, where is the maximum number of production rules per nonterminal, we obtain the activation vector with size as:
(5) 
We constrain so that its each column is a vector with only one nonzero element (i.e., each production rule may originate from only one nonterminal). In the actual implementation, is obtained by modeling it as a matrix and then inflating it with zeros to have the form of a block diagonal matrix of size with the block size .
Similarly, the function mapping each production rule to the next nonterminal and corresponding terminal is implemented using a matrix , and a matrix :
(6) 
where . With this implementation, learning the grammar production rules is done by learning the matrices , , and directly. Figure 4 describes an example.
3.2 Learning
We train our grammar model to minimize the following binary cross entropy loss:
(7) 
where is the ground truth label vector at time with dimensionality and is the output of the grammar model (terminal). In the case where the grammar is used to predict discrete class labels, becomes a onehot vector. Training of our functions and (or matrices , , and
) can be done with a straight forward backpropagation for the simple production rule case of Eq.
3, as it becomes a deterministic function per nonterminal at each time step. Backpropagating through the entire sequential application of our functions also allow learning of the starting nonterminal representation .Learning multiple production rules:
In general, our function maps a nonterminal to a ‘set’ of production rules where different rules could be equally valid. This means that we are required to train the model by generating many sequences, by taking rules at each step ( is the branching factor).
We enumerate through multiple productions rules by randomizing the production rule selection by using the GumbelSoftmax trick (Jang et al., 2017; Maddison et al., 2017)
as suggested in the above subsection. This allows for weighted random selection of the rules based on the learned rule probabilities. In order the train our grammar model with the GumbelSoftmax, we maintain multiple different ‘branches’ of nonterminal selections and terminal generations, and measure the loss by considering all of them. Algo.
1 and Figure 3 illustrate the training and branching process. When generating many branches, we compute the loss for each generated sequence, then take the minimum loss over the branches, effectively choosing the branch that generated the most similar string:(8) 
where is the output of the grammar model (terminals) at time for class and branch . Branches are pruned to make the process computationally tractable, limiting the total number of branches we maintain.
3.3 Interpretability
As our model is constrained to use a finite set of nonterminals, terminals and production rules, it allows for easy interpretability of the learned grammar structure. We can conceptually convert the learned production rule matrices , , and into a discrete set of symbolic production rules by associating symbols with the learned terminal (and nonterminal) representations. The matrix describe the lefthand side nonterminal of the production rule following the regular grammar (e.g., ), the matrix describes the terminal of the production rule (e.g., ), and the matrix corresponds to the righthand side nonterminal of the rule (e.g., ). Element values of the matrix in particular suggests the probability associated with the production rule (i.e., it governs the probably of the corresponding production rule being randomly selected with GumbelSoftmax). Fig. 4 shows how we can construct a grammar from the learned matrices.
Fig. 8 illustrates examples of such interpreted grammar, learned from a raw baseball video dataset. This was done by associating symbols with and .
3.4 Application to video datasets
While application of our differentiable grammar learning to 1D data is rather straightforward, when applying this to more complex continuous data with various contents such as videos, certain extensions are needed.
To apply the grammar model to videos, we make a few key changes. The initial nonterminal is learned based on the video representation. We learn a function that maps from the video representation to the initial nonterminal: , where is the output of a video CNN (e.g., I3D (Carreira & Zisserman, 2017)). We then train the grammar model as above, where the ground truth is the sequence of onehot vector based on the activity labels in the video.
During inference (which is about predicting framelevel activity labels), we generate a sequence by selecting the rule that best matches the CNN predicted classes. We then multiply the predictions from the grammar with the predictions from the CNN. To predict future, yetunseen actions, we generate a sequence following the most likely production rules.
4 Experiments
4.1 Toy Examples
We first confirm that our model is able to learn the rules of simple, handcrafted grammars and show how we can easily interpret the learned model. Given the simple grammar:
We train a model with 3 terminal symbols (, , and ), 3 nonterminal symbols (, and ), and 2 production rules per nonterminal. We can then examine the learned grammar structure, shown in Fig. 4. We observe that the learned starting nonterminal corresponds to ‘A’, and by following the learned rules, we end up with ‘aB’. From nonterminal ‘B’, the learned rules go to ‘bA’ or ‘bC’ with 50% probability. From nonterminal ‘C’, the learned rules go to ‘cA’. This confirms that the model is able to learn grammar rules and can easily be interpreted.
4.2 Air Pollution Timeseries Dataset
We further test the algorithm on a timeseries dataset in order to demonstrate its use to timeseries domain.
The Air Polution prediction dataset (Liang et al., 2016) is intended to predict urban pollution levels. The data provides measurements hourly, for 24 hours a day and spans several years of sensing. This dataset contains several environmental factors as input features and measures the overall air pollution (‘PM2.5 concentration’). It contains about 43,000 examples which are split consecutively into train and test with a 50:50 ratio.
Figure 5 visualizes the prediction results for our model. As seen it is correctly approximating the actual values (a portion of the training data is also shown). We evaluate the model by measuring root mean squared error. By simply predicting the last seen value for the remaining data, we get RMSE of . Using our grammar model, we get RMSE of . An LSTMbased model performs at .
4.3 Activity Detection Experiments
We further confirm that our method works on 3 realworld, challenging activity detection datasets: MLBYouTube (Piergiovanni & Ryoo, 2018a), Charades (Sigurdsson et al., 2016b), and MultiTHUMOS (Yeung et al., 2015). All datasets are evaluated by perframe mAP. The datasets are described as follows:
MultiTHUMOS: The MultiTHUMOS dataset (Yeung et al., 2015) is a large scale video analysis dataset which has framelevel annotations for activity recognition. It is a challenging dataset and supports dense multiclass annotations (i.e. per frame), which are also used here for both prediction and ground truth. It contains 400 videos or about 30 hours of video and 65 action classes. Examples are shown in Fig. 7.
Charades: The Charades dataset (Sigurdsson et al., 2016b) is a challenging dataset with unstructured activities in videos. The videos are everyday activities in a home environment. It contains 9858 videos and spans 157 classes. Examples are shown in Fig. 6.
MLBYouTube: The MLBYouTube dataset (Piergiovanni & Ryoo, 2018a) is a challenging video activity recognition dataset collected from live TV broadcast baseball games (with many challenges, such as the small resolution of activities in question). It further offers the challenge of finegrained activity recognition as all potential activities are encountered in the same context and environment, unlike many other datasets which feature more diverse activities which may also use context for recognition. It has 4290 videos in 42 hours of video. Additionally, baseball games follow a rigid structure, making it ideal to evaluate the learned grammar. Some example frames are shown in Fig. 1.
Implementation Details
4.4 Results on MLBYoutube
Table 1 shows the results of the proposed algorithm on the MLBYoutube dataset, compared to all stateoftheart algorithms including RNNs such as LSTMs and Neural Turing Machines (NTM). We evaluated the methods in two different settings: 1) learning grammar on top of features learned from I3D and 2) on top of a recently proposed superevents method. The result clearly shows that our differentiable grammar learning is able to better capture temporal/sequential information in videos. We also compare to LSTMs and NTMs using both CNN features (e.g., I3D) as input and using the predicted class probabilities as input, as that is more comparable to our grammar model. We find that the use of class probabilities slightly degrades performance for LSTMs and NTMs.
Model  mAP 

Random  13.4 
I3D  34.2 
I3D + LSTM  39.4 
I3D + NTM (Graves et al., 2014)  36.8 
I3D class prob + LSTM  37.4 
I3D class prob + NTM (Graves et al., 2014)  36.8 
I3D with Grammar (ours)  43.4 
I3D + superevents (Piergiovanni & Ryoo, 2018b)  39.1 
I3D + superevents with Grammar (ours)  44.2 
4.5 Results on MultiTHUMOS
Table 2 shows results comparing two common methods with and without the proposed grammar. We also test both settings as above and compare to the stateoftheart. In both settings we can see that use of the learned grammar outperforms previously known methods.
Method  mAP 

Twostream (Yeung et al., 2015)  27.6 
Twostream + LSTM (Yeung et al., 2015)  28.1 
MultiLSTM (Yeung et al., 2015)  29.6 
Predictivecorrective (Dave et al., 2017)  29.7 
I3D baseline  29.7 
I3D + LSTM  29.9 
I3D + NTM (Graves et al., 2014)  29.8 
I3D class prob + LSTM  29.8 
I3D class prob + NTM (Graves et al., 2014)  29.7 
I3D with Grammar (ours)  32.3 
I3D + superevents (Piergiovanni & Ryoo, 2018b)  36.4 
I3D + superevents with Grammar (ours)  37.7 
4.6 Results on Charades
Table 3 has results comparing the proposed grammar to other prior techniques on the Charades dataset (localization_v1 setting). As seen, this dataset is quite challenging since recently its detection accuracy was below 10 percent mAP. Our results here too outperform the stateoftheart, increasing the accuracy on this dataset to over 20 percent mAP. We note that there are consistent improvements in both settings, similar to the results on MultiTHUMOS and MLBYouTube. In particular, the differentiable grammar learning outperformed previous RNNs including LSTMs and NTMs.
Method  mAP 

Predictivecorrective (Dave et al., 2017)  8.9 
Twostream (Sigurdsson et al., 2016a)  8.94 
Twostream+LSTM (Sigurdsson et al., 2016a)  9.6 
RC3D (Xu et al., 2017)  12.7 
Sigurdsson et al. (Sigurdsson et al., 2016a)  12.8 
I3D baseline  17.2 
I3D + LSTM  18.1 
I3D + NTM (Graves et al., 2014)  17.5 
I3D class prob + LSTM  17.6 
I3D class prob + NTM (Graves et al., 2014)  17.4 
I3D with Grammar (ours)  18.5 
I3D + superevents (Piergiovanni & Ryoo, 2018b)  19.4 
I3D + superevents with Grammar (ours)  20.3 
4.7 Future Prediction/Forecasting
As our grammar model is generative, we can apply it to predict the future, unseen activities. Future prediction is important, especially for autonomous systems (e.g., robots) as they need to anticipate potential future activities to respond to. Once the grammar is learned, future sequences containing unseen activities can be generated by selecting the most probable production rule at every (future) time step.
For this experiment we consider predicting at shortterm horizons (in the next 2 seconds), midterm horizons (next 10 seconds), and more longerterm horizons (in the next 20 seconds). We compare to baselines such as random guessing, repeatedly predicting the last seen frame, and an LSTM approach (using I3D features) which has been commonly used for future frame forecasting. We evaluate these methods using perframe mAP.
Table 4 shows the results for future prediction for the MultiTHUMOS dataset. We confirm the proposed method is more accurate at future prediction at all future horizons considered. We note that 1020 seconds in the future is a very challenging setting to try to predict especially in the context of multilabel datasets.
Table 5 shows the results for future prediction for the Charades dataset. Here too, we can see the proposed grammar approach is more accurate at future frame prediction, with predictions at 10 seconds in the future outperforming stateoftheart for 2 seconds only. This data is more challenging by itself, which makes the future prediction even harder.
Method  2 sec  10 sec  20 sec 

Random  2.6  2.6  2.6 
Last frame  6.2  5.8  2.8 
I3D + LSTM  8.5  6.6  2.9 
I3D + Grammar (ours)  10.4  8.3  3.5 
Method  2 sec  10 sec  20 sec 

Random  2.4  2.4  2.4 
Last frame  6.8  3.3  2.4 
I3D + LSTM  6.5  4.6  2.5 
I3D + Grammar (ours)  8.6  7.3  5.5 
4.8 Visualization of Learned Grammars
In Figure 4, we illustrate how we convert from the learned matrices to the grammar and production rules. From the training data, we know the mapping from terminal symbol to label. We can then examine the rule matrix, and the nonterminals, to construct the rules.
We also visualize the learned grammar for the MLBYouTube dataset, in which interestingly the typical baseball sequences are learned. Figure 8 is the conceptual visualization of the learned regular grammar. In Figure 9, we illustrate the actual learned matrices corresponding to one of the production rules. In Figure 10, we illustrate how all the learned rules are inferred from the learned matrices.
In Figure 8, we illustrate the learned grammar. We see that that the learned grammar matches the structure in a baseball game and the probabilities are similar to the observed data, confirming that our model is able to learn the correct rule structure. For example, an activity starts with a pitch which can be followed by a swing, bunt or a hit. After a hit, foul, or strike, another pitch follows. The learned grammar is illustrated with probabilities for each rule in parenthesis.
5 Related work
Chomsky grammars (Chomsky, 1956, 1959) are designed to represent functional linguistic relationships. They have found wide applications in defining programming languages, natural language understanding, and understanding of images and videos (Socher et al., 2011).
There are early works exploring extracting grammars/state machines from trained RNNs (Kolen, 1994; Bodén & Wiles, 2000; Tiňo et al., 1998). Other works have attempted to learn ‘neural push downautomata’ to learn contextfree grammars (Sun et al., 2017) or neural Turing Machines (Graves et al., 2014). However, these works only explored simple toy experiments, and were never tested on realworld data.
Some works have explored learning more explicit structures by forcing states to be discrete and uses pseudogradients to learn grammatical structures (Zeng et al., 1994). However, they still rely on a standard RNN to learn model the sequences. It has also been found that LSTMs/RNNs are able to learn grammars (Gers & Schmidhuber, 2001; Giles et al., 1995; Das et al., 1992). Different to all these works, we design a neural network architecture that is able to explicitly model the structure of a grammar, which leads to much easier interpretability.
Other works have explored using neural networks to learn a parser. Socher et al. (2011) parsed scenes by learning to merge representations. Mayberry & Miikkulainen (1999) learned a shiftreduce neural network parser and Chen & Manning (2014) learn a dependency parser as a neural network. While these works learn some grammar structure, it is difficult to interpret what they are learning.
Within activity recognition, regular and contextfree grammars have been used to parse and understand videos (Moore & Essa, 2002; Pirsiavash & Ramanan, 2014; Ivanov & Bobick, 2000; Ryoo & Aggarwal, 2009; Si et al., 2011). Other works have extended CFGs such as attribute grammars (Joo & Chellappa, 2006) or using contextsensitive constraints and interval logic (Brendel et al., 2011; Kwak et al., 2014).
6 Conclusion
In conclusion, we presented a differentiable model for learning grammars for the purposes of parsing videos or other streaming data. The learned structures are interpretable which is important for understanding the behavior of the model and the decisions made. The proposed method outperforms all prior stateoftheart techniques on several challenging benchmarks. Furthermore, it can predict future events with higher accuracy, which is necessary for anticipation and reaction to future actions.
In the future we plan to apply it to even longer horizon data streams. Further, we aim to enable application of our differentiable grammar learning to higherdimensional representations, learning them jointly with image/video CNNs in an endtoend fashion.
References
 Bodén & Wiles (2000) Bodén, M. and Wiles, J. Contextfree and contextsensitive dynamics in recurrent neural networks. Connection Science, 12(34):197–210, 2000.

Brendel et al. (2011)
Brendel, W., Fern, A., and Todorovic, S.
Probabilistic event logic for intervalbased event recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2011.  Carreira & Zisserman (2017) Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Chen & Manning (2014)
Chen, D. and Manning, C.
A fast and accurate dependency parser using neural networks.
In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pp. 740–750, 2014.  Cho et al. (2014) Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, 2014.
 Chomsky (1956) Chomsky, N. Three models for the description of language. In IRE Transactions on Information Theory (2), pp. 113–124, 1956.
 Chomsky (1959) Chomsky, N. On certain formal properties of grammars. In Information and Control. 2 (2), pp. 137–167, 1959.
 Das et al. (1992) Das, S., Giles, C. L., and Sun, G.Z. Learning contextfree grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society, pp. 14, 1992.
 Dave et al. (2017) Dave, A., Russakovsky, O., and Ramanan, D. Predictivecorrective networks for action detection. arXiv preprint arXiv:1704.03615, 2017.
 Dyer et al. (2016) Dyer, C., Kuncoro, A., Ballesteros, M., and Smith, N. A. Recurrent neural network grammars. In NAACLHLT, 2016.
 Gers & Schmidhuber (2001) Gers, F. A. and Schmidhuber, J. Lstm recurrent networks learn simple contextfree and contextsensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340, 2001.
 Giles et al. (1995) Giles, C. L., Horne, B. G., and Lin, T. Learning a class of large finite state machines with a recurrent neural network. Neural Networks, 8(9):1359–1365, 1995.
 Graves et al. (2014) Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9:1735–80, 12 1997. doi: 10.1162/neco.1997.9.8.1735.
 Ivanov & Bobick (2000) Ivanov, Y. A. and Bobick, A. F. Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):852–872, 2000.
 Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations (ICLR), 2017.

Joo & Chellappa (2006)
Joo, S.W. and Chellappa, R.
Attribute grammarbased event recognition and anomaly detection.
In Computer Vision and Pattern Recognition Workshop. IEEE, 2006.  Kolen (1994) Kolen, J. F. Fool’s gold: Extracting finite state machines from recurrent network dynamics. In Advances in Neural Information Processing Systems (NIPS), 1994.
 Kwak et al. (2014) Kwak, S., Han, B., and Han, J. H. Online video event detection by constraint flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1174–1186, 2014.
 Liang et al. (2016) Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., and Chen, S. X. Assessing beijing’s pm2.5 pollution: severity, weather impact, apec and winter heating. In Proceedings of the Royal Society A, 471, 20150257, 2016.

Maddison et al. (2017)
Maddison, C. J., Mnih, A., and Teh, Y. W.
The concrete distribution: A continuous relaxation of discrete random variables.
In International Conference on Learning Representations (ICLR), 2017. 
Mayberry & Miikkulainen (1999)
Mayberry, M. R. and Miikkulainen, R.
Sardsrn: A neural network shiftreduce parser.
In
Proceedings of the 16th Annual Joint Conference on Artificial Intelligence
, 1999.  Moore & Essa (2002) Moore, D. and Essa, I. Recognizing multitasked activities from video using stochastic contextfree grammar. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2002.
 Piergiovanni & Ryoo (2018a) Piergiovanni, A. and Ryoo, M. S. Finegrained activity recognition in baseball videos. In CVPR Workshop on Computer Vision in Sports, 2018a.
 Piergiovanni & Ryoo (2018b) Piergiovanni, A. and Ryoo, M. S. Learning latent superevents to detect multiple activities in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018b.
 Pirsiavash & Ramanan (2014) Pirsiavash, H. and Ramanan, D. Parsing videos of actions with segmental grammars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 Ryoo & Aggarwal (2009) Ryoo, M. S. and Aggarwal, J. K. Semantic representation and recognition of continued and recursive human activities. International Journal of Computer Vision (IJCV), 82(1):1–24, 2009.
 Si et al. (2011) Si, Z., Pei, M., Yao, B., and Zhu, S.C. Unsupervised learning of event andor grammar and semantics from video. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011.
 Sigurdsson et al. (2016a) Sigurdsson, G. A., Divvala, S., Farhadi, A., and Gupta, A. Asynchronous temporal fields for action recognition. arXiv preprint arXiv:1612.06371, 2016a.
 Sigurdsson et al. (2016b) Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., and Gupta, A. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of European Conference on Computer Vision (ECCV), 2016b.
 Socher et al. (2011) Socher, R., Lin, C. C., Manning, C., and Ng, A. Y. Parsing natural scenes and natural language with recursive neural networks. In Advances in Neural Information Processing Systems (NIPS), 2011.
 Sun et al. (2017) Sun, G.Z., Giles, C. L., Chen, H.H., and Lee, Y.C. The neural network pushdown automaton: Model, stack and learning simulations. arXiv preprint arXiv:1711.05738, 2017.
 Tiňo et al. (1998) Tiňo, P., Horne, B. G., Giles, C. L., and Collingwood, P. C. Finite state machines and recurrent neural networks—automata and dynamical systems approaches. In Neural networks and pattern recognition, pp. 171–219. Elsevier, 1998.
 Xu et al. (2017) Xu, H., Das, A., and Saenko, K. Rc3d: Region convolutional 3d network for temporal activity detection. arXiv preprint arXiv:1703.07814, 2017.

Yeung et al. (2015)
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., and FeiFei, L.
Every moment counts: Dense detailed labeling of actions in complex videos.
International Journal of Computer Vision (IJCV), pp. 1–15, 2015.  Zeng et al. (1994) Zeng, Z., Goodman, R. M., and Smyth, P. Discrete recurrent neural networks for grammatical inference. IEEE Transactions on Neural Networks, 5(2):320–330, 1994.
Comments
There are no comments yet.