Developing novel neural network architectures is at the core of many recent AI advances (Szegedy et al., 2015; He et al., 2016; Zilly et al., 2016). The process of architecture search and engineering is slow, costly, and laborious. Human experts, guided by intuition, explore an extensive space of potential architectures where even minor modifications can produce unexpected results. Ideally, an automated architecture search algorithm would find the optimal model architecture for a given task.
We propose a meta-learning strategy for flexible automated architecture search of recurrent neural networks (RNNs) which explicitly includes novel operators in the search. It consists of three stages, outlined in Figure 1, for which we instantiate two versions.
A candidate architecture generation function produces potential RNN architectures using a highly flexible DSL. The DSL enforces no constraints on the size or complexity of the generated tree and can be incrementally constructed using either a random policy or with an RL agent.
A ranking function processes each candidate architecture’s DSL via a recursive neural network, predicting the architecture’s performance. By unrolling the RNN representation, the ranking function can also model the interactions of a candidate architecture’s hidden state over time.
An evaluator, which takes the most promising candidate architectures, compiles their DSLs to executable code and trains each model on a specified task. The results of these evaluations form architecture-performance pairs that are then used to train the ranking function and RL generator.
2 A Domain Specific Language for Defining RNNs
In this section, we describe a domain specific language (DSL) used to define recurrent neural network architectures. This DSL sets out the search space that our candidate generator can traverse during architecture search. In comparison to Zoph and Le (2017), which only produced a binary tree with matrix multiplications at the leaves, our DSL allows a broader modeling search space to be explored.
When defining the search space, we want to allow for standard RNN architectures such as the Gated Recurrent Unit (GRU) (Cho et al., 2014) or Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to be defined in both a human and machine readable manner.
The core operators for the DSL are 4 unary operators, 2 binary operators, and a single ternary operator:
represents a single linear layer with bias, i.e. . Similarly, we define: . The operator represents element-wise multiplication: . The operator performs a weighted summation between two inputs, defined by These operators are applied to source nodes from the set , where and
are the input vectors for the current and previous timestep,is the output of the RNN for the previous timestep, and is optional long term memory.
The operator is required as some architectures, such as the GRU, re-use the output of a single for the purposes of gating. While allowing all possible node re-use is out of scope for this DSL, the ternary operator allows for this frequent use case.
Using this DSL, standard RNN cell architectures such as the RNN can be defined: . To illustrate a more complex example that includes , the GRU is defined in full in Appendix A.
2.1 Adding support for architectures with long term memory
With the operators defined above it is not possible to refer to and re-use an arbitrary node. The best performing RNN architectures however generally use not only a hidden state but also an additional hidden state for long term memory. The value of is extracted from an internal node computed while producing .
The DSL above can be extended to support the use of by numbering the nodes and then specifying which node to extract from (i.e. ). We append the node number to the end of the DSL definition after a delimiter.
As an example, the nodes in bold are used to produce , with the number appended at the end indicating the node’s number. Nodes are numbered top to bottom ( will be largest), left to right.
2.2 Expressibility of the domain specific language
While the domain specific language is not entirely generic, it is flexible enough to capture most standard RNN architectures. This includes but is not limited to the GRU, LSTM, Minimal Gate Unit (MGU) (Zhou et al., 2016), Quasi-Recurrent Neural Network (QRNN) (Bradbury et al., 2017), Neural Architecture Search Cell (NASCell) (Zoph and Le, 2017), and simple RNNs.
2.3 Extending the Domain Specific Language
While many standard and non-standard RNN architectures can be defined using the core DSL, the promise of automated architecture search is in designing radically novel architectures. Such architectures should be formed not just by removing human bias from the search process but by including operators that have not been sufficiently explored. For our expanded DSL, we include:
These extensions add inverses of currently used operators ( instead of addition, instead of multiplication), trigonometric curves ( and are sine and cosine activations respectively, introduces a variable that is the result of applying positional encoding (Vaswani et al., 2017) according to the current timestep), and optimizations ( applies layer normalization (Ba et al., 2016) to the input while
is the activation function defined inKlambauer et al. (2017)).
2.4 Compiling a DSL definition to executable code
For a given architecture definition, we can compile the DSL to code by traversing the tree from the source nodes towards the final node . We produce two sets of source code – one for initialization required by a node, such as defining a set of weights for matrix multiplication, and one for the forward call during runtime. For details regarding speed optimizations, refer to Appendix A2.
3 Candidate Architecture generation
The candidate architecture generator is responsible for producing candidate architectures that are then later filtered and evaluated. Architectures are grown beginning at the output and ordered to prevent multiple representations for equivalent architectures:
Growing architectures from up
Beginning from the output node , operators are selected to be added to the computation graph, depicted in Figure 2. Whenever an operator has one or more children to be filled, the children are filled in order from left to right. If we wish to place a limit on the height (distance from ) of the tree, we can force the next child to be one of the source nodes when it would otherwise exceed the maximum height.
Preventing duplicates through canonical architecture ordering
Due to the flexibility allowed by the DSL, there exist many DSL specifications that result in the same RNN cell. To solve the issue of commutative operators (i.e. ), we define a canonical ordering of an architecture by sorting the arguments of any commutative nodes. Special consideration is required for non-commutative operators such as , Sub, or Div. For full details, refer to Appendix A3.
3.1 Incremental Architecture Construction using Reinforcement Learning
Architectures in the DSL are constructed incrementally a node at a time starting from the output . The simplest agent is a random one which selects the next node from the set of operators without internalizing any knowledge about the architecture or optima in the search space. Allowing an intelligent agent to construct architectures would be preferable as the agent can learn to focus on promising directions in the space of possible architectures.
For an agent to make intelligent decisions regarding which node to select next, it must have a representation of the current state of the architecture and a working memory to direct its actions. We propose achieving this with two components:
a tree encoder that represents the current state of the (partial) architecture.
an RNN which is fed the current tree state and samples the next node.
The tree encoder is an LSTM applied recursively to a node token and all its children with weights shared, but the state reset between nodes. The RNN is applied on top of the encoded partial architecture and predicts action scores for each operation. We sample with a multinomial and encourage exploration with an epsilon-greedy strategy. Both components of the model are trained jointly using the REINFORCE algorithm (Williams, 1992).
As a partial architecture may contain two or more empty nodes, such as , we introduce a target token, , which indicates which node is to next be selected. Thus, in , the tree encoder understands that the first argument is the slot to be filled.
3.2 Filtering Candidate Architectures using a Ranking Function
Even with an intelligent generator, understanding the likely performance of an architecture is difficult, especially the interaction of hidden states such as and between timesteps. We propose to approximate the full training of a candidate architecture by training a ranking network through regression on architecture-performance pairs. This ranking function can be specifically constructed to allow a richer representation of the transitions between and .
As the ranking function uses architecture-performance samples as training data, human experts can also inject previous best known architectures into the training dataset. This is not possible for on-policy reinforcement learning and when done using off-policy reinforcement learning additional care and complexity are required for it to be effective (Harutyunyan et al., 2016; Munos et al., 2016).
Given an architecture-performance pair, the ranking function constructs a recursive neural network that reflects the nodes in a candidate RNN architecture one-to-one. Sources nodes are represented by a learned vector and operators are represented by a learned function. The final vector output then passes through a linear activation and attempts to minimize the difference between the predicted and real performance. The source nodes (, , , and ) are represented by learned vector representations. For the operators in the tree, we use TreeLSTM nodes (Tai et al., 2015). All operators other than are commutative and hence can be represented by Child-Sum TreeLSTM nodes. The operators are represented using an -ary TreeLSTM which allows for ordered children.
Unrolling the graph for accurately representing and :
A strong assumption made above is that the vector representation of the source nodes can accurately represent the contents of the source nodes across a variety of architectures. This may hold true for and but is not true for or . The value of and are defined by the operations within the given architecture itself.
To remedy this assumption, we can unroll the architecture for a single timestep, replacing and with their relevant graph and subgraph. This would allow the representation of to understand which source nodes it had access to and which operations were applied to produce .
While unrolling is useful for improving the representation of , it is essential for allowing an accurate representation of . This is as many small variations of are possible – such as selecting a subgraph before or after an activation – that may result in substantially different architecture performance.
We evaluated our architecture generation on two experiments: language modeling (LM) and machine translation (MT). Due to the computational requirements of the experiments, we limited each experiment to one combination of generator components. For language modeling, we explore the core DSL using randomly constructed architectures (random search) directed by a learned ranking function. For machine translation, we use the extended DSL and construct candidate architectures incrementally using the RL generator without a ranking function.
4.1 Language Modeling using Random Search with a Ranking Function
For evaluating architectures found during architecture search, we use the WikiText-2 dataset (Merity et al., 2017b). When evaluating a proposed novel RNN cell , we construct a two layer
-RNN with a 200 unit hidden size. Aggressive gradient clipping is performed to ensure that architectures such as the ReLU RNN would be able to train without exploding gradients. The weights of the ranking network were trained by regression on architecture-perplexity pairs using the Adam optimizer and mean squared error (MSE). Further hyperparameters and training details are listed in Appendix B1.
Explicit restrictions on generated architectures
During the candidate generation phase, we filter the generated architectures based upon specific restrictions. These include structural restrictions and restrictions aimed at effectively reducing the search space by removing likely invalid architectures. For operations, we force the input to the forget gate to be the result of a sigmoid activation. We also require the cell to use the current timestep and the previous timestep’s output to satisfy the requirements of an RNN. Candidate architectures were limited to nodes, the same number of nodes as used in a GRU, and the maximum allowed distance (height) from was steps.
We also prevent the stacking of two identical operations. While this may be an aggressive filter it successfully removes many problematic architectures. These problematic architectures include when two sigmoid activations, two ReLU activations, or two matrix multiplications are used in succession – the first of which is unlikely to be useful, the second of which is a null operator on the second activation, and the third of which can be mathematically rewritten as a single matrix multiplication.
If a given candidate architecture definition contained , the architecture was queried for valid subgraphs from which could be generated. The subgraphs must contain such that is recurrent and must contain three or more nodes to prevent trivial recurrent connections. A new candidate architecture is then generated for each valid subgraph.
Random architecture search directed by a learned ranking function
Up to 50,000 candidate architecture DSL definitions are produced by a random architecture generator at the beginning of each search step. This full set of candidate architectures are then simulated by the ranking network and an estimated perplexity assigned to each. Given the relative simplicity and small training dataset, the ranking function was retrained on the previous full training results before being used to estimate the next batch of candidate architectures. Up toarchitectures were then selected for full training. of these were selected from the candidate architectures with the best perplexity while the last were selected via weighted sampling without replacement, prioritizing architectures with better estimated perplexities.
architectures were introduced part way through the architecture search after 750 valid architectures had been evaluated with architectures being used to bootstrap the architecture vector representations. Figure 3 provides a visualization of the architecture search over time, showing valid and architectures.
Analyzing the BC3 cell
After evaluating the top 10 cells using a larger model on WikiText-2, the top performing cell BC3 (named after the identifying hash, bc3dc7a) was an unexpected layering of two operators,
where is an element-wise multiplication and all weight matrices .
Equations 1 to 3 produce the first while equations 4 and 5 produce the second . The output of the first becomes the value for after passing through a activation.
While only the core DSL was used, BC3 still breaks with many human intuitions regarding RNN architectures. While the formulation of the gates and are standard in many RNN architectures, the rest of the architecture is less intuitive. The that produces (equation 3) is mixing between a matrix multiplication of the current input and a complex interaction between and (equation 2). In BC3, passes through multiple matrix multiplications, a gate, and a activation before becoming . This is non-conventional as most RNN architectures allow to become directly, usually through a gating operation. The architecture also does not feature a masking output gate like the LSTM, with outputs more similar to that of the GRU that does poorly on language modeling. That this architecture would be able to learn without severe instability or succumbing to exploding gradients is not intuitively obvious.
4.1.1 Evaluating the BC3 cell
For the final results on BC3, we use the experimental setup from Merity et al. (2017a) and report results for the Penn Treebank (Table 1) and WikiText-2 (Table 2) datasets. To show that not any standard RNN can achieve similar perplexity on the given setup, we also implemented and tuned a GRU based model which we found to strongly underperform compared to the LSTM, BC3, NASCell, or Recurrent Highway Network (RHN). Full hyperparameters for the GRU and BC3 are in Appendix B4. Our model uses equal or fewer parameters compared to the models it is compared against. While BC3 did not outperform the highly tuned AWD-LSTM (Merity et al., 2017a) or skip connection LSTM (Melis et al., 2017), it did outperform the Recurrent Highway Network (Zilly et al., 2016) and NASCell (Zoph and Le, 2017) on the Penn Treebank, where NASCell is an RNN found using reinforcement learning architecture search specifically optimized over the Penn Treebank.
|Inan et al. (2016) - Variational LSTM (tied) + augmented loss||24M|
|Inan et al. (2016) - Variational LSTM (tied) + augmented loss||51M|
|Zilly et al. (2016) - Variational RHN (tied)||23M|
|Zoph and Le (2017) - Variational NAS Cell (tied)||25M|
|Zoph and Le (2017) - Variational NAS Cell (tied)||54M|
|Melis et al. (2017) - 4-layer skip connection LSTM (tied)||24M|
|Merity et al. (2017a) - 3-layer weight drop LSTM + NT-ASGD (tied)||24M|
|3-layer weight drop GRU + NT-ASGD (tied)||24M|
|3-layer weight drop BC3 + NT-ASGD (tied)||24M|
|Inan et al. (2016) - Variational LSTM (tied)||28M|
|Inan et al. (2016) - Variational LSTM (tied) + augmented loss||28M|
|Melis et al. (2017) - 1-layer LSTM (tied)||24M|
|Melis et al. (2017) - 2-layer skip connection LSTM (tied)||24M|
|Merity et al. (2017a) - 3-layer weight drop LSTM + NT-ASGD (tied)||33M|
|3-layer weight drop GRU + NT-ASGD (tied)||33M|
|3-layer weight drop BC3 + NT-ASGD (tied)||33M|
4.2 Incremental Architecture Construction using RL for Machine Translation
For our experiments involving the extended DSL and our RL based generator, we use machine translation as our domain. The candidate architectures produced by the RL agent were directly used without the assistance of a ranking function. This leads to a different kind of generator: whereas the ranking function learns global knowledge about the whole architecture, the RL agent is trimmed towards local knowledge about which operator is ideal to be next.
Before evaluating the constructed architectures, we pre-train our generator to internalize intuitive priors. These priors include enforcing well formed RNNs (i.e. ensuring , , and one or more matrix multiplications and activations are used) and moderate depth restrictions (between 3 and 11 nodes deep). The full list of priors and model details are in Appendix C1.
For the model evaluation, we ran up to 28 architectures in parallel, optimizing one batch after receiving results from at least four architectures. As failing architectures (such as those with exploding gradients) return early, we needed to ensure the batch contained a mix of both positive and negative results. To ensure the generator yielded mostly functioning architectures whilst understanding the negative impact of invalid architectures, we chose to require at least three good architectures with a maximum of one failing architecture per batch.
For candidate architectures with multiple placement options for the memory gate , we evaluated all possible locations and waited until we had received the results for all variations. The best architecture result was then used as the reward for the architecture.
Baseline Machine Translation Experiment Details
To ensure our baseline experiment was fast enough to evaluate many candidate architectures, we used the Multi30k English to German (Elliott et al., 2016) machine translation dataset. The training set consists of 30,000 sentence pairs that briefly describe Flickr captions. Our experiments are based on OpenNMT codebase with an attentional unidirectional encoder-decoder LSTM architecture, where we specifically replace the LSTM encoder with architectures designed using the extend DSL.
For the hyper-parameters in our baseline experiment, we use a hidden and word encoding size of 300, 2 layers for the encoder and decoder RNNs, batch size of 64, back-propagation through time of 35 timesteps, dropout of
, input feeding, and stochastic gradient descent. The learning rate starts at 1 and decays by 50% when validation perplexity fails to improve. Training stops when the learning rate drops below.
Analysis of the Machine Translation Architecture Search
Figure 4 shows the relative frequency of each operator in the architectures that were used to optimize the generator each batch.
For all the architectures in a batch, we sum up the absolute number that each operator occurs and divide by the total number of operators in all architectures of the batch. By doing this for all batches (x-axis), we can see which operators the generator prefers over time.
Intriguingly, the generator seems to rely almost exclusively on the core DSL () when generating early batches. The low usage of the extended DSL operators may also be due to these operators frequently resulting in unstable architectures, thus being ignored in early batches. Part way through training however the generator begins successfully using a wide variety of the extended DSL (). We hypothesize that the generator first learns to build robust architectures and is only then capable of inserting more varied operators without compromising the RNN’s overall stability. Since the reward function it is fitting is complex and unknown to the generator, it requires substantial training time before the generator can understand how robust architectures are structured. However, the generator seems to view the extended DSL as beneficial given it continues using these operators.
Overall, the generator found 806 architectures that out-performed the LSTM based on raw test BLEU score, out of a total of 3450 evaluated architectures (23%). The best architecture (determined by the validation BLEU score) achieved a test BLEU score of respectively, compared to the standard LSTM’s . Multiple cells also rediscovered a variant of residual networks () (He et al., 2016) or highway networks () (Srivastava et al., 2015). Every operation in the core and extended DSL made their way into an architecture that outperformed the LSTM and many of the architectures found by the generator would likely not be considered valid by the standards of current architectures. These results suggest that the space of successful RNN architectures might hold many unexplored combinations with human bias possibly preventing their discovery.
In Table 3 we take the top five architectures found during automated architecture search on the Multi30k dataset and test them over the IWSLT 2016 (English to German) dataset (Cettolo et al., 2016). The training set consists of 209,772 sentence pairs from transcribed TED presentations that cover a wide variety of topics with more conversational language than in the Multi30k dataset. This dataset is larger, both in number of sentences and vocabulary, and was not seen during the architecture search. While all six architectures achieved higher validation and test BLEU on Multi30k than the LSTM baseline, it appears the architectures did not transfer cleanly to the larger IWSLT dataset. This suggests that architecture search should be either run on larger datasets to begin with (a computationally expensive proposition) or evaluated over multiple datasets if the aim is to produce general architectures. We also found that the correlation between loss and BLEU is far from optimal: architectures performing exceptionally well on the loss sometimes scored poorly on BLEU. It is also unclear how these metrics generalize to perceived human quality of the model (Tan et al., 2015) and thus using a qualitatively and quantitatively more accurate metric is likely to benefit the generator. For hyper parameters of the IWSLT model, refer to Appendix C3.
|DSL Architecture Description||Multi30k||IWSLT’16|
|(encoder used in attentional encoder-decoder LSTM)||Val Loss||Test BLEU||Test BLEU|
|5-deep nested with LayerNorm (see Appendix C2)|
|Residual with positional encoding (see Appendix C2)|
5 Related Work
Architecture engineering has a long history, with many traditional explorations involving a large amount of computing resources and an extensive exploration of hyperparamters (Jozefowicz et al., 2015; Greff et al., 2016; Britz et al., 2017). The approach most similar to our work is Zoph and Le (2017) which introduces a policy gradient approach to search for convolutional and recurrent neural architectures. Their approach to generating recurrent neural networks was slot filling, where element-wise operations were selected for the nodes of a binary tree of specific size. The node to produce was selected once all slots had been filled. This slot filling approach is not highly flexible in regards to the architectures it allows. As opposed to our DSL, it is not possible to have matrix multiplications on internal nodes, inputs can only be used at the bottom of the tree, and there is no complex representation of the hidden states or as our unrolling ranking function provides. Many other similar techniques utilizing reinforcement learning approaches have emerged such as designing CNN architectures with Q-learning (Baker et al., 2016).
Neuroevolution techniques such as NeuroEvolution of Augmenting Topologies (NEAT) (Stanley and Miikkulainen, 2002) and HyperNEAT (Stanley et al., 2009) evolve the weight parameters and structures of neural networks. These techniques have been extended to producing the non-shared weights for an LSTM from a small neural network (Ha et al., 2016) and evolving the structure of a network (Fernando et al., 2016; Bayer et al., 2009).
We introduced a flexible domain specific language for defining recurrent neural network architectures that can represent most human designed architectures. It is this flexibility that allowed our generators to come up with novel combinations in two tasks. These architectures used both core operators that are already used in current architectures as well as operators that are largely unstudied such as division or sine curves. The resulting architectures do not follow human intuition yet perform well on their targeted tasks, suggesting the space of usable RNN architectures is far larger than previously assumed. We also introduce a component-based concept for architecture search from which we instantiated two approaches: a ranking function driven search which allows for richer representations of complex RNN architectures that involve long term memory () nodes, and a Reinforcement Learning agent that internalizes knowledge about the search space to propose increasingly better architectures. As computing resources continue to grow, we see automated architecture generation as a promising avenue for future research.
- Ba et al. (2016) J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer Normalization. arXiv preprint, 2016.
- Baker et al. (2016) B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing Neural Network Architectures using Reinforcement Learning. arXiv preprint, 2016.
- Bayer et al. (2009) J. Bayer, D. Wierstra, J. Togelius, and J. Schmidhuber. Evolving memory cell structures for sequence learning. Artificial Neural Networks – ICANN, 2009.
- Bergstra et al. (2011) J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, 2011.
- Bradbury et al. (2017) J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-Recurrent Neural Networks. International Conference on Learning Representations (ICLR), 2017.
Britz et al. (2017)
D. Britz, A. Goldie, M.-T. Luong, and Q. V. Le.
Massive exploration of neural machine translation architectures.CoRR, 2017.
- Cettolo et al. (2016) M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico. The IWSLT 2016 Evaluation Campaign. Proceedings of the 13th Workshop on Spoken Language Translation, 2016.
- Cho et al. (2014) K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. CoRR, 2014.
- Elliott et al. (2016) D. Elliott, S. Frank, K. Sima’an, and L. Specia. Multi30k: Multilingual English-German Image Descriptions. arXiv preprint, 2016.
Fernando et al. (2016)
C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg,
M. Lanctot, and D. Wierstra.
Convolution by evolution: Differentiable pattern producing networks.
Proceedings of the 2016 on Genetic and Evolutionary Computation Conference, pages 109–116. ACM, 2016.
- Gal and Ghahramani (2016) Y. Gal and Z. Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1019–1027, 2016.
- Greff et al. (2016) K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems, 2016.
- Ha et al. (2016) D. Ha, A. Dai, and Q. V. Le. HyperNetworks. arXiv preprint, 2016.
- Harutyunyan et al. (2016) A. Harutyunyan, M. G. Bellemare, T. Stepleton, and R. Munos. Q() with Off-Policy Corrections. In International Conference on Algorithmic Learning Theory, pages 305–320. Springer, 2016.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
- Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, Nov 1997.
Inan et al. (2016)
H. Inan, K. Khosravi, and R. Socher.
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling.arXiv preprint, 2016.
- Jozefowicz et al. (2015) R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2342–2350, 2015.
- Klambauer et al. (2017) G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-Normalizing Neural Networks. arXiv preprint, 2017.
- Melis et al. (2017) G. Melis, C. Dyer, and P. Blunsom. On the State of the Art of Evaluation in Neural Language Models. arXiv preprint, 2017.
- Merity et al. (2017a) S. Merity, N. S. Keskar, and R. Socher. Regularizing and Optimizing LSTM Language Models. arXiv preprint, 2017a.
- Merity et al. (2017b) S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer Sentinel Mixture Models. In ICLR, 2017b.
- Munos et al. (2016) R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1046–1054, 2016.
- Press and Wolf (2016) O. Press and L. Wolf. Using the output embedding to improve language models. arXiv preprint, 2016.
- Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In NIPS, 2012.
- Srivastava et al. (2015) R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway Networks. arXiv preprint, 2015.
- Stanley and Miikkulainen (2002) K. O. Stanley and R. Miikkulainen. Evolving Neural Networks through Augmenting Topologies. Evolutionary computation, 10(2):99–127, 2002.
- Stanley et al. (2009) K. O. Stanley, D. B. D’Ambrosio, and J. Gauci. A Hypercube-Based Encoding for Evolving Large-Scale Neural Networks. Artificial life, 15 2:185–212, 2009.
- Szegedy et al. (2015) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- Tai et al. (2015) K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. ACL, 2015.
- Tan et al. (2015) L. Tan, J. Dehdari, and J. van Genabith. An awkward disparity between bleu/ribes scores and human judgements in machine translation. In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), pages 74–81, 2015.
- Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention Is All You Need. arXiv preprint, 2017.
- Williams (1992) R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Zaremba et al. (2014) W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint, 2014.
- Zhou et al. (2016) G.-B. Zhou, J. Wu, C.-L. Zhang, and Z.-H. Zhou. Minimal gated unit for recurrent neural networks. CoRR, abs/1603.09420, 2016.
- Zilly et al. (2016) J. G. Zilly, R. K. Srivastava, J. Koutník, and J. Schmidhuber. Recurrent Highway Networks. arXiv preprint, 2016.
- Zoph and Le (2017) B. Zoph and Q. V. Le. Neural Architecture Search with Reinforcement Learning. International Conference on Learning Representations (ICLR), 2017.
Appendix A: Domain Specific Language
DSL GRU Definition
DSL BC3 Definition
1. Architecture optimizations during DSL compilation
To improve the running speed of the RNN cell architectures, we can collect all matrix multiplications performed on a single source node (, , , or ) and batch them into a single matrix multiplication. As an example, this optimization would simplify the LSTM’s 8 small matrix multiplications (4 for , 4 for ) into 2 large matrix multiplications. This allows for higher GPU utilization and lower CUDA kernel launch overhead.
2. preventing duplicates through canonical architecture ordering
There exist many possible DSL specifications that result in an equivalent RNN cell. When two matrix multiplications are applied to the same source node, such as , a matrix multiplication reaching an equivalent result can be achieved by constructing a specific matrix and calculating . Additional equivalences can be found when an operator is commutative, such as being equivalent to the reordered . We can define a canonical ordering of an architecture by sorting the arguments of any commutative nodes. In our work, nodes are sorted according to their DSL represented as a string, though any consistent ordering is allowable. For our core DSL, the only non-commutative operation is , where the first two arguments can be sorted, but the input to the gate must remain in its original position. For our extended DSL, the and operators are order sensitive and disallow any reordering.
Appendix B: Language Modeling
1. Baseline experimental setup and hyperparameters
The models are trained using stochastic gradient descent (SGD) with an initial learning rate of . Training continues for epochs with the learning rate being divided by if the validation perplexity has not improved since the last epoch. Dropout is applied to the word embeddings and outputs of layers as in Zaremba et al.  at a rate of . Weights for the word vectors and the were also tied [Inan et al., 2016, Press and Wolf, 2016]. Aggressive gradient clipping () was performed to ensure that architectures such as the ReLU RNN would be able to train without exploding gradients. The embeddings were initialized randomly between .
During training, any candidate architectures that experienced exploding gradients or had perplexity over 500 after five epochs were regarded as failed architectures. Failed architectures were immediately terminated. While not desirable, failed architectures still serve as useful training examples for the ranking function.
2. Ranking function hyperparameters
For the ranking function, we use a hidden size of for the TreeLSTM nodes and a batch size of . We use regularization of and dropout on the final dense layer output of . As we are more interested in reducing the perplexity error for better architectures, we sample architectures more frequently if their perplexity is lower.
For unrolling of the architectures, a proper unroll would replace with and with . We found the ranking network performed better without these substitutions however and thus only substituted to and to .
3. How baseline experimental settings may impair architecture search
The baseline experiments that are used during the architecture search are important in dictating what models are eventually generated. As an example, BC3 may not have been discovered if we had used all the standard regularization techniques in the baseline language modeling experiment. Analyzing how variational dropout [Gal and Ghahramani, 2016] would work when applied to BC3 frames the importance of hyperparameter selection for the baseline experiment.
On LSTM cells, variational dropout [Gal and Ghahramani, 2016] is only performed upon , not , as otherwise the long term hidden state would be destroyed. For BC3, equation 6 shows that the final gating operation mixes and . If variational dropout is applied to in this equation, BC3’s hidden state will have permanently lost information. Applying variational dropout only to the values in the two gates and ensures no information is lost. This observation provides good justification for not performing variational dropout in the baseline experiment given that this architecture (and any architecture which uses in a direct manner like this) would be disadvantaged otherwise.
4. Hyperparameters for PTB and WikiText-2 BC3 experiments
For the Penn Treebank BC3 language modeling results, the majority of hyper parameters were left equal to that of the baseline AWD-LSTM. The model was trained for 200 epochs using NT-ASGD with a learning rate of 15, a batch size of and BPTT of . The variational dropout for the input, RNN hidden layers, and output were set to , , and respectively. Embedding dropout of was used. The word vectors had dimensionality of and the hidden layers had dimensionality of . The BC3 used 3 layers with weight drop of applied to the recurrent weight matrices. Activation regularization and temporal activation regularization of and were used. Gradient clipping was set to . Finetuning was run for an additional 13 epochs. For the WikiText-2 BC3 language modeling results, the parameters were kept equal to that of the Penn Treebank experiment. The model was run for a total of 100 epochs with 7 epochs of finetuning.
For the Penn Treebank GRU language modeling results, the hyper parameters were equal to that of the BC3 PTB experiment but with a hidden size of , weight drop of , learning rate of , and gradient clipping of , and temporal activation regularization of . The model was run for 280 epochs with 6 epochs of finetuning. For the WikiText-2 GRU language modeling results, the hyper parameters were kept equal to those of the Penn Treebank experiment. The model was run for 125 epochs with 50 epochs of finetuning where the weight drop was reduced to .
Appendix C: Machine Translation
1. Baseline experimental setup and hyperparameters for RL
To represent an architecture with the encoder, we traverse through the architecture recursively, starting from the root node. For each node, the operation is tokenized and embedded into a vector. An LSTM is then applied to this vector as well as the result vectors of all of the current node’s children. Note that we use the same LSTM for every node but reset its hidden states between nodes so that it always starts from scratch for every child node.
Based on the encoder’s vector representation of an architecture, the action scores are determined as follows:
We then choose the specific action with a multinomial applied to the action scores. We encourage exploration by randomly choosing the next action according to an epsilon-greedy strategy with .
The reward that expresses how well an architecture performed is computed based on the validation loss. We re-scale it according to a soft exponential so that the last few increases (distinguishing a good architecture from a great one) are rewarded more. The specific reward function we use is which follows earlier efforts to keep the reward between zero and 140.
For pre-training the generator, our list of priors are:
to maintain a depth between 3 and 11
use the current input , the hidden state , at least one matrix multiplication (MM) and at least one activation function
do not use the same operation as a child
do not use an activation as a child of another activation
do not use the same inputs to a gate
that the matrix multiplication operator (MM) should be applied to a source node
2. Full DSL for deeply nested and residual with positional encoding
For obvious reasons these DSL definitions would not fit within the result table. They do give an indication of the flexibility of the DSL however, ranging from minimal to quite complex!
5-deep nested with LayerNorm
Residual with positional encoding
3. Hyper parameters for the IWSLT’16 models
For the models trained on the IWSLT’16 English to German dataset, the hyper parameters were kept largely equivalent to that of the default OpenMT LSTM baseline.
All models were unidirectional and the dimensionality of both the word vectors and hidden states were , required as many of the generated architectures were residual in nature.
The models were 2 layers deep, utilized a batch size of , and standard dropout of between layers.
The learning rate began at and decayed by whenever validation perplexity failed to improve.
When the learning rate fell below training was finished.
Models were evaluated using a batch size of to ensure RNN padding did not impact the results.
to ensure RNN padding did not impact the results.
4. Finding an architecture for the decoder and encoder/decoder
We also briefly explored automated architecture generation for the decoder as well as for the encoder and decoder jointly which yielded good results with interesting architectures but performances fell short of the more promising approach of focusing on the encoder.
With additional evaluated models, we believe both of these would yield comparable or greater results.
5. Variation in activation patterns by generated architectures
Given the differences in generated architectures, and the usage of components likely to impact the long term hidden state of the RNN models, we began to explore the progression of the hidden state over time. Each of the activations differs substantially from those of the other architectures even though they are parsing the same input.
As the input features are likely to not only be captured in different ways but also stored and processed differently, this suggests that ensembles of these highly heterogeneous architectures may be effective.