Neural Abstract Reasoner

11/12/2020 ∙ by Victor Kolev, et al. ∙ 0

Abstract reasoning and logic inference are difficult problems for neural networks, yet essential to their applicability in highly structured domains. In this work we demonstrate that a well known technique such as spectral regularization can significantly boost the capabilities of a neural learner. We introduce the Neural Abstract Reasoner (NAR), a memory augmented architecture capable of learning and using abstract rules. We show that, when trained with spectral regularization, NAR achieves 78.8% accuracy on the Abstraction and Reasoning Corpus, improving performance 4 times over the best known human hand-crafted symbolic solvers. We provide some intuition for the effects of spectral regularization in the domain of abstract reasoning based on theoretical generalization bounds and Solomonoff's theory of inductive inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Extracting and reasoning with abstract concepts is a crucial ability for any learner that is to operate in combinatorially complex open worlds or domains with limited or structured data. It is well known that neural learners struggle to operate in such conditions due to their poor generalization capabilities in structured domains Chollet (2019); Marcus (2018). In this work, we demonstrate that spectral regularization provides neural networks with a strong inductive bias towards learning and utilizing abstract concepts akin to a symbolic learner.

For that purpose, we employ the Abstraction and Reasoning Corpus (ARC) Chollet (2019) which contains tasks related to manipulating colored patterns in a grid. In order to successfully solve the tasks in the corpus an agent needs to be able to count, manipulate numbers, work with topological and geometric concepts as well as recognise the notion of objects. There are 400 training tasks and 400 (distinct) evaluation tasks. Each task has a small set of input-output example pairs (between 1 and 5) and a query input pattern. This is quite a challenging dataset due to the small amount of example data, large number of different tasks and their abstract nature.

So far, the best known solution, with a success rate of  20%, is the winner of the ARC Kaggle challenge, and it is a carefully hand-crafted symbolic solver written in approx. 7k lines of C++ code. In this paper, we introduce the Neural Abstract Reasoner (NAR) which achieves an accuracy rate of  79% and so outperforming even the best symbolic solver created by a human. The NAR architecture contains a Differentiable Neural Computer (DNC) that learns general problem solving skills and a Transformer network responsible for solving particular task instances (see. Fig 2 in

Chollet (2019)

). 7u Importantly, spectral regularization plays a fundamental role in the successful training of NAR. From a purely machine learning perspective, spectral regularization is known to reduce the effective number of parameters in the network, however we provide some additional theoretical intuition and demonstrate that spectral regularization also pushes the network towards finding algorithmically simpler solutions as recommended by Solomonoff’s theory of inductive inference

Li and Vitanyi (2008).

2 Related Work

Neuro-symbolic architectures

Hybrid neuro-symbolic approaches enable agents to solve structured tasks from raw data, while learning faster and being more robust to noise Gaunt et al. (2017); Verma et al. (2018); Penkov and Ramamoorthy (2018); Mao et al. (2018). However, the majority of methods proposed so far are designed with specific domains in mind, making them inapplicable to a broader range of tasks. A notable exception is the architecture proposed by Ellis et al. (2020)

, which is capable of learning rules from geometry, vector algebra, and physics and solve tasks such as drawing pictures or building complete scenes. Importantly, these methods often require lots of data, which is in stark contrast with human capabilities.

The ARC dataset Chollet (2019) is specially designed to push research towards data efficient learners, as there are hunderds of tasks, each of which is represented by no more than 5 input/output examples. To the best of our knowledge, the Neural Abstract Reasoner, presented in this paper, is the first architecture that achieves a performance rate of  79% on the ARC dataset, outperforming state-of-the-art hand-coded symoblic systems by a factor of 4. The NAR architecture is a composition of a slowly learning Differentiable Neural Computer and a fast adapting Transformer network creating an outer learning and inner executing loops, as suggested in Chollet (2019).

Complexity and generalization

The analysis of complexity and generalization metrics applied to neural networks has formed a central line of theoretical ML research with a variety of recent breakthroughs in terms of PAC-based and compression methods (cf. Arora et al. (2018); Bartlett et al. (2017); Jiang* et al. (2020); Suzuki et al. (2020); Wei and Ma (2019) and the references therein). In particular, many generalization approaches based on spectral norm analysis have been so far proposed and investigated Neyshabur et al. (2017); Sanyal et al. (2020). However, to our knowledge the present work is the first to address the relationship between spectral norms’ behaviour and abstract reasoning tasks, whereby a strong relationship between a classical spectral regularisation and the ability of a neural model towards learning abstract reasoning (concepts and rules) is demonstrated. As touched upon in Section 4, one could draw motivation from well-known algorithmic information theory concepts such as Solomonoff inference and program generation based on least complexity Li and Vitanyi (2008); Blier and Ollivier (2018); Schmidhuber (1997).

3 Methods

Description

The ARC dataset consists of a train and evaluation portions and , respectively. Each portion consists of 400 tasks (train tasks are augmented to 15000 through color permutations and rotations). The individual tasks are grouped in tags based on the skills needed to solve them Bonin . A task consists of up to five example input-output pairs and one query pair. A neural learner has to infer a logical rule . All inputs and outputs are grids of variable sizes with 10 colors. A solution is correct only when all the pixels on the grid match.

First, we derive a latent representation of the grids with an InceptionNet-style Szegedy et al. (2015) deterministic auto-encoder. Let be a grid, and be the encoder and decoder, parametrized by

. We train an embedding network by minimizing the standard autoencoder cross-entropy loss

.

Next, we consider all latent grid embeddings, and we train a Differentiable Neural Computer Graves et al. (2016) with parameters to infer an instruction set . All inputs are processed by a Transformer Decoder Stack Vaswani et al. (2017) with parameters , which self-attends to all inputs and cross-attends to :

The whole model is then trained end-to-end via ADAM Kingma and Ba (2014) to minimize the cross-entropy loss between the query target and the decoded test output prediction.

We employ a two-stage curriculum during training, first training only a on a single tag , and then expanding to the whole train dataset . Additionally, during the first stage of training, spectral regularization Yoshida and Miyato (2017) with a larger value is applied, which is then annealed in the second stage. When evaluating the model, we apply additional optimization steps (similar to Finn et al. (2017); Krause et al. (2018)), as described in Algorithm 1. See Appendix C for additional details.

Motivation

We build on methods from Santoro et al. (2016) and use a memory-augmented neural network (the Differentiable Neural Computer Graves et al. (2016)) to derive context for the current task. The multiple read heads and attention mechanisms allow the DNC to relate the input and the output of a pair and compare them to the other input/output pairs that it has already processed. Unlike Santoro et al. (2016), we leverage a Transformer to carry out the task execution based on the DNC context. This decouples the learning of the instruction set from the program execution itself, and allows us to use the input/output relations directly, rather than the more standard . Lastly, the Transformer network relates the inputs to each other, thereby exploiting similarities within them.

Performance

As this work is still in progress, these are preliminary results evaluated on grids up to . Nonetheless, we outperform all currently known solutions, including a hand-crafted symbolic solution (see Fig. 1). Spectral regularization proved instrumental for this, and other regularization methods did not yield any significant results (see Fig. 2).

Without any additional adaptation steps Finn et al. (2017), evaluation performance remains low at 1%, while only after 3 steps, that number climbs up to 78.8%. Analyzing more closely the network changes made by the adaptation steps, the gradient norm of is , which implies that the DNC is acting as a true meta-learner, and only the Transformer requires a small change to execute the instructions flawlessly. We again attribute this generalization to spectral regularization.

Given: evaluation dataset;
;
Hyperparameters: step size; number of steps
for Task  do
       ,
       for step in 1:k do
            
            
             Evaluate
             Adjust parameters:
            
            
            
       end for
      
      
       Compute accuracy of
      
end for
Algorithm 1 NAR evaluation cycle
Figure 1: We achieve 78.8% evaluation accuracy on the ARC dataset. The reported results are calculated for 100 unseen tasks from the corpus.

4 Effects of spectral regularization: stable ranks and complexity

The surprisingly significant effect of a simple spectral regularization strategy in the reasoning tasks suggests strong connections with generalization and the underlying model complexity estimates. On one hand, this motivates the analysis of spectral regularization in terms of some well-known generalization bounds (e.g. based on stable rank and spectral norms), however, we first discuss a perspective inspired by algorithmic information theory. Intuitively, abstract reasoning tasks are induced by a concise set of logic rules and combinatorial patterns, and, hence, it is natural to search for

short programs producing these rules - in this regard, we give motivation as to why spectral regularization naturally shrinks the search space towards shorter programs.

Spectral regularization and polynomials as algorithmically simple programs. Classical methods from program inference and algorithmic information theory, such as Solomonoff inference and Occam’s razor Li and Vitanyi (2008), suggest that "simpler" program models are preferable in terms of forming abstract concepts and generalization - a formal approach towards such issues is given, e.g., by Kolmogorov complexity theory Li and Vitanyi (2008); Schmidhuber (1997). Although the evaluation of algorithmic complexity is a demanding task (Kolmogorov complexity is theoretically uncomputable), one could attempt to devise various proxy metrics that capture the algorithmic complexity of a given function/program.

Here, in an attempt to evaluate and explain the algorithmic complexity of our models from a spectral-regularization perspective, we consider approximations in terms of a simple but flexible class of programs computing rational-coefficient polynomials of maximal degree . Intuitively, approximating a model in terms of for lower values of corresponds to discovering programs of decreased algorithmic length (in terms of operations) that effectively compute . In this direction, we bring forward some classical approximation theory results implying that lower spectral norms yield lower degrees of the approximation polynomial. To ease notation, here we discuss the 1-dimensional case, however, similar results hold for higher dimensions as well Trefethen (2013):

Proposition 1.

Let represent a model with Lipschitz constant . Then, there exists a polynomial of degree , so that where denotes the usual -norm over the interval .

Since the Lipschitz constant of a neural model is bounded above by the spectral norms of the layers , spectral regularization gives control over ; moreover, a lower value of implies that one can decrease the polynomial degree and retain similar approximation qualities. These observations support our empirical results - introducing spectral regularization steers the model search space towards algorithmically simpler and more robust functions.

Generalization via spectral norms and stable ranks. We recall that the stable rank of a matrix , , is defined as the ratio and note that is at most the rank of . The stable rank is intuitively understood as a continuous proxy to and as a measure for the true parameter count of . Now, let be a deep neural model consisting of layers whose corresponding weight matrices are denoted by . Recent works (e.g. Neyshabur et al. (2017); Arora et al. (2018)) obtain generalization bounds on , roughly speaking, in terms of the expression where denotes the spectral norm of the matrix . A related stronger compression-based estimate in terms of so-called noise-cushions is obtained in Arora et al. (2018). In other words, the generalization error is influenced by the spectral norms as well as stable ranks of the layers.

In this direction, we evaluated our model and the effect of stable ranks. Interpreting Fig. 3, one observes that initially while the model adopts to the single task of pattern_expansion it increases and stabilizes a true parameter count ; afterwards, the model is introduced to the full task bundle where a significant decrease of the stable ranks is observed - according to the last expression this leads to better generalization, and further implies that at the end of training one actually deals with simpler models with better compression properties.

Figure 2: Performance on the pattern expansion task from the ARC dataset. The exact same models (DNC Transformer) were trained with and without spectral regularization.
Figure 3: Stable ranks vary drastically depending on the task distribution. With a larger pool of tasks, NN’s stable ranks decrease rapidly, as it is optimized for greater generalization, as opposed to specialization to a particular task.

5 Conclusion and Acknowledgements

We have demonstrated the spectral regularization provides neural learners with a significant boost in performance on abstract reasoning tasks. We believe that studying the complexity of the underlying models in the context of powerful frameworks such as Kolmogorov complexity or Solomonoff’s theory of inductive inference is a promising step towards closing the neuro-symbolic gap. We would like to thank Dimitar Vasilev (Microsoft Inc.) for the computational resources used in this work.

References

  • S. Arora, R. Ge, B. Neyshabur, and Y. Zhang (2018) Stronger generalization bounds for deep nets via a compression approach. In 35th International Conference on Machine Learning, ICML 2018, Vol. 1, pp. 390–418. External Links: 1802.05296, ISBN 9781510867963 Cited by: §2, §4.
  • P. L. Bartlett, D. J. Foster, and M. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, External Links: 1706.08498, ISSN 10495258 Cited by: §2.
  • L. Blier and Y. Ollivier (2018)

    The description length of deep learning models

    .
    In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 2216–2226. External Links: Link Cited by: §2.
  • [4] D. Bonin Task tagging. Kaggle public notebook. Note: Accessed on 08.10.2020 External Links: Link Cited by: §3.
  • F. Chollet (2019) On the measure of intelligence. arXiv preprint arXiv:1911.01547. Cited by: Appendix Appendix A, §1, §1, §1, §2.
  • K. Ellis, C. Wong, M. Nye, M. Sable-Meyer, L. Cary, L. Morales, L. Hewitt, A. Solar-Lezama, and J. B. Tenenbaum (2020)

    DreamCoder: growing generalizable, interpretable knowledge with wake-sleep bayesian program learning

    .
    arXiv preprint arXiv:2006.08381. Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §3, §3.
  • A. L. Gaunt, M. Brockschmidt, N. Kushman, and D. Tarlow (2017) Differentiable programs with neural libraries. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1213–1222. Cited by: §2.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §3, §3.
  • Y. Jiang*, B. Neyshabur*, H. Mobahi, D. Krishnan, and S. Bengio (2020) Fantastic generalization measures and where to find them. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
  • B. Krause, E. Kahembwe, I. Murray, and S. Renals (2018) Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pp. 2766–2775. Cited by: §3.
  • M. Li and P. M.B. Vitanyi (2008) An introduction to kolmogorov complexity and its applications. 3 edition, Springer Publishing Company, Incorporated. External Links: ISBN 0387339981 Cited by: §1, §2, §4.
  • A. Linhares (2000) A glimpse at the metaphysics of bongard problems. Artificial Intelligence 121 (1-2), pp. 251–270. Cited by: Appendix Appendix A.
  • J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2018) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, Cited by: §2.
  • G. Marcus (2018) Deep learning: a critical appraisal. arXiv preprint arXiv:1801.00631. Cited by: §1.
  • B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, External Links: 1706.08947, ISSN 10495258 Cited by: §2, §4.
  • S. Penkov and S. Ramamoorthy (2018) Learning programmatically structured representations with perceptor gradients. In International Conference on Learning Representations, Cited by: §2.
  • [19] (2006) Polynomial approximation theory. In Spectral Methods: Fundamentals in Single Domains, External Links: ISBN 978-3-540-30726-6, Document, Link Cited by: Appendix Appendix F, Appendix Appendix F.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §3.
  • A. Sanyal, P. H. Torr, and P. K. Dokania (2020) Stable rank normalization for improved generalization in neural networks and gans. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • J. Schmidhuber (1997) Discovering neural nets with low kolmogorov complexity and high generalization capability. Neural Netw. 10 (5), pp. 857–873. External Links: ISSN 0893-6080, Link, Document Cited by: §2, §4.
  • T. Suzuki, H. Abe, and T. Nishimura (2020) Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network. International Conference on Learning Representations. External Links: Link Cited by: §2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1–9. Cited by: §3.
  • L. Trefethen (2017) Multivariate polynomial approximation in the hypercube. Proceedings of the American Mathematical Society 145 (), pp. 4837–4844. Cited by: Appendix Appendix F.
  • L. N. Trefethen (2013) Approximation theory and approximation practice. SIAM. Cited by: Appendix Appendix F, Appendix Appendix F, §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
  • A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri (2018)

    Programmatically interpretable reinforcement learning

    .
    In International Conference on Machine Learning, pp. 5045–5054. Cited by: §2.
  • C. Wei and T. Ma (2019) Data-dependent sample complexity of deep neural networks via lipschitz augmentation. In Advances in Neural Information Processing Systems 32, pp. 9725–9736. Cited by: §2.
  • Y. Yoshida and T. Miyato (2017) Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §3.

Appendix Appendix A Abstraction and Reasoning Corpus

The Abstraction and Reasoning Corpus (ARC) [Chollet, 2019] is a dataset of grid-based pattern recognition and pattern manipulation tasks. A decision-making agent sees a small number of examples of input and output grids that illustrate the underlying logical relationship between them. It then has to infer this logical rule and apply it correctly on a test query.

In many ways, the benchmark is similar to the Bongard problems (view [Linhares, 2000]) – relations are highly abstract and geometric. Moreover, only 3-5 examples are presented for each task, therefore, the benchmark tests the ability of an decision-making agent to (i) grasp abstract logic and (ii) adapt quickly to new tasks.

There are 400 training and 400 evaluation task examples, structured as follows:

  • each task consists of a train and a test set;

  • the train set includes 3-5 input/output pairs;

  • the test set includes 1 input/output pair;

  • an input/output pair is comprised of an input grid and an output grid, the relation between which follows a consistent logic throughout the task;

  • grids are rectangular and are divided into squares;

  • grid patterns are drawn with 10 colors;

  • grid sizes vary between 1 and 30 in length and width; input and output grid sizes are not necessarily equal.

No set of rules exists that can solve all tasks, and while some skills are useful for multiple of them, each task has its own unique principle. This makes trivial approaches like brute-force computation impractical.

If a human was approaching those tasks, they would easily be able to spot logical relations – we have developed the necessary priors to find similarities and infer logic. Therefore, ideally the neural network would derive this prior during training, and that would allow it to generalize well to the evaluation dataset.

For the current scope of our research we use all tasks with grids of size not larger than . For the train set, we augment the tasks to 15000 by permuting colors and by exploiting that the tasks are invariant to rotation and symmetry.

Figure 4: Structure of the Abstraction and Reasoning Corpus dataset.

Appendix Appendix B Grid Embedding

Prior to embedding, we zero-pad all grids to be

, with the original grid in the center of the image. Additionally, colors in the grids are represented as one-hot vectors, making the final dimensionality of the grids (10 colors).

The embedding is done with a convolutional neural network, comprised of an encoder and a decoder. The encoder consists of a basket of 10 convolutions of filter sizes equal to

(the module, Fig. 5

). Different filter sizes enable the network to capture both local and global patterns. The convolution outputs are flattened and passed to linear layers with hyperbolic tangent activation functions, which transform them into the desired dimensionality (

). A second neural network computes weights for summing the 10 resulting vectors. The decoder shares the same architecture with the encoder, but in reverse order – first linear layers, after that convolutions and then a weighted sum; finally a softmax over the color dimension.

Summing the convolutional outputs enables the embedding network to be agnostic to the order in which it receives them (as would be with an RNN for instance). The weighs reflect the fact that grid sizes vary and therefore not all filter sizes would be equally applicable or useful.

While training the Embedding network, we found that spectral regularization again proved to be instrumental for achieving 95% perfect reconstruction accuracy. In contrast, networks regularized by weight decay failed to climb above 6%.

What is more, tanh functions yielded a network that is 3 orders of magnitude more stable to Gaussian input noise than the same network, trained under the same conditions, with ReLU activations. We postulate that this is due to the fact that ReLU is unbounded for

, therefore random perturbations would have a greater impact.

Figure 5: Structure of the Embedding network.

Appendix Appendix C Architecture Details

Figure 6 and 7 give greater detail about the architecture we devise, as well as the procedure for training it end-to-end.

Figure 6: Order of procedures in training the model. Total runtime is hours on a single K80 GPU.
Figure 7: We use a memory-augmented neural network (DNC) to extract the context of the current task from example input-output pairs . The final DNC output is passed to the cross-attention layers of a Transformer Decoder, which processes all inputs (examples and query) and produces output predictions. Loss is calculated as by-pixel cross-entropy of the output predictions and the targets, and the model is trained end-to-end.

Appendix Appendix D Additional Results

Figure 8:

When we observed that stable ranks, a proxy of the number of parameters, were negatively correlated with generalization performance, we analyzed if an explicitly smaller model would improve performance without spectral regularization. While better performance and greater variance are observed, the model with spectral regularization remains unmatched.

Appendix Appendix E Hyperparamters

Hyperparameter Value Description
learning rate 0.001 Learning rate of the models
regularization 6e-4 Coefficient in front of the spectral regularization penalty in the loss
annealing factor 10 Factor by which is divided to weaken regularization
Embedding network hyperparameters
latent dimension 256 The dimension of the embedded grids
kernel sizes Sizes of the used kernels in the convolutional modules
# of kernels 128 The number of convolutional kernels
DNC hyperparameters (temporal linkage is disabled)
# of read heads 6 Number of attention heads in the reading mechanism
# of LSTM layers 3 Layers of the LSTM controller
LSTM hidden size 512 Hidden size of the LSTM
memory dimesions Word length of 64, 32 memory locations
Transformer hyperparameters
# of decoders 4 Number of networks in the decoder stack
# of attention heads 16 Number of heads in the multi-head attention mechanism
Transformer hidden size 256 Internal hidden size of the linear layers in the Transformer

Appendix Appendix F Approximation via polynomial programs

In this Section we briefly elaborate on the polynomial approximation mentioned in the main text. In general, classical results in this direction are based, e.g. on Chebyshev, Legendre and Bersntein polynomial approximations. Here we discuss a 1-dimensional approximation via Bernstein polynomials, but similar estimates hold in higher dimensions as well as other polynomial schemes (e.g. Chebyshev). We refer to 19, L. N. Trefethen (2013) for a thorough collection of results as well as further references.

A well-known method of approximating one-dimensional continuous functions is by means of Bernstein polynomials. Let be continuous. The Bernstein polynomial of order corresponding to is defined as:

(1)
Theorem 1.

The following estimate holds:

(2)

where denotes the modulus of continuity defined as

(3)

In other words, measures the maximal jump over points which are no more than a distance apart.

Note that whenever has a Lipschitz constant , the modulus of continuity is controlled above via . Applying this bound in the estimate of Theorem 1 yields the result mentioned in the text. As already mentioned, more technical high dimensional analogues of the estimates are also available L. N. Trefethen (2013), 19 - e.g. an analogues result for Chebyshev polynomials holds where the modulus of continuity is appropriately replaced by extrema over complex ellipsoids Trefethen [2017].