Seeing Colors: Learning Semantic Text Encoding for Classification

by   Shah Nawaz, et al.
SEECS Orientation

The question we answer with this work is: can we convert a text document into an image to exploit best image classification models to classify documents? To answer this question we present a novel text classification method which converts a text document into an encoded image, using word embedding and capabilities of Convolutional Neural Networks (CNNs), successfully employed in image classification. We evaluate our approach by obtaining promising results on some well-known benchmark datasets for text classification. This work allows the application of many of the advanced CNN architectures developed for Computer Vision to Natural Language Processing. We test the proposed approach on a multi-modal dataset, proving that it is possible to use a single deep model to represent text and image in the same feature space.



There are no comments yet.


page 3

page 4

page 6


Image and Encoded Text Fusion for Multi-Modal Classification

Multi-modal approaches employ data from multiple input streams such as t...

In-depth Question classification using Convolutional Neural Networks

Convolutional neural networks for computer vision are fairly intuitive. ...

Doc2Im: document to image conversion through self-attentive embedding

Text classification is a fundamental task in NLP applications. Latest re...

Improving accuracy and speeding up Document Image Classification through parallel systems

This paper presents a study showing the benefits of the EfficientNet mod...

A Fast Fully Octave Convolutional Neural Network for Document Image Segmentation

The Know Your Customer (KYC) and Anti Money Laundering (AML) are worldwi...

A novel text representation which enables image classifiers to perform text classification, applied to name disambiguation

Patent data are often used to study the process of innovation and resear...

Deep Learning for Technical Document Classification

In large technology companies, the requirements for managing and organiz...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text classification is a common task in Natural Language Processing. Its goal is to assign a label to a text document from a predefined set of classes. Researchers have extensively focused on designing best features along with the choice of the best possible machine learning classifier. Traditional methods used in text classification such as n-grams 


captured the linguistic structure from a statistical point of view, however, recent methods successfully employ deep learning techniques and CNNs for text classification 

kimconvolutional ; zhang2015character . Image classification has a similar aim to text classification and recently CNNs have become the de-facto standard in this field krizhevsky2012imagenet ; szegedy2015going .

Word embedding models Mikolov2013nips ; pennington2014glove; le2014distributed convert words or sentences into vectors of real numbers. Typically these models are trained on large corpus of text documents to capture semantic relationships among words. Thus they can produce similar word embeddings for words occurring in similar contexts. We exploit this fundamental property of word embedding models to transform a text document into a sequence of colors, obtaining an artificial (encoded) image, as shown in Figure 1. Intuitively, semantically related words obtain similar colors or encodings in the encoded image while uncorrelated words are represented with different colors.

We present a novel text classification approach to cast text documents into visual domain. Our approach transforms text documents into encoded images capitalizing on word embeddings. We evaluate the method on several large scale datasets obtaining promising and, in some cases, state-of-the-art results. In summary, main contributions of our paper include:
– We present an approach to transform text documents into encoded images based on Word2Vec word embedding.
– A real-world multi-modal application based on the proposed approach that uses a single model to manage joint representations of image and encoded text.

A previous version of our encoding scheme was published in ICDAR 2017 gallosemantic2017. In this work, we explore various parameters associated with encoding scheme. We extensively evaluate the improved encoding scheme on benchmark datasets. Furthermore, we present an application scenario based on our encoding scheme to fuse encoded text and image for multi-modal classification.

Figure 3: Starting from an encoded text document, the resulting image is classified by a CNN model normally employed in image classification. The first convolutional layers look some particular features of visual words while the remaining convolutional layers can recognize sentences and increasingly complex concepts.

2 Related Work

Much of the work with deep learning methods for text documents involved learning word vector representations through neural language models Mikolov2013nips ; pennington2014glove. This vector representation serves as a foundation of our work where word vectors are transformed into a sequence of colors. Tang et al. tang2014learning adopted an ad-hoc model using three neural networks to encode the sentiment information from text. They concluded that word embedding models presented in Mikolov2013nips and collobert2011natural are not effective for sentiment classification.

Figure 4: A graphical representation of our encoding scheme. The figure shows an encoded image of size , with visual words having width which contains superpixels of size .

There are many works in literature like zheng2017dual ; wehrmann2018order ; wang2016learning; wang2017learning that deal with both images and texts to achieve semantic alignment. However, they need to employ different deep models making their approaches computationally intensive, reinforced by nawaz2018revisiting

. Moreover, they have to create ad-hoc loss functions to achieve the result. With our proposed method, to solve similar task we need just a single deep neural network. Zhang et al. 

zhang2015character treated text as a raw signal at character level and employed deep architecture to perform text classification. Similarly, Conneau et al. conneau2017very presented a deep architecture that operates at character level with convolutional layers to learn hierarchical representations of text. This architecture is inspired by recent progress in computer vision simonyan2014very ; he2016deep . In our work, we leverage on recent success in computer vision, but instead of adapting deep neural network to be fed with text information, we propose an approach that converts text documents into encoded text. Once we have encoded text, we can apply state-of-the-art deep neural architectures. We compared our proposed approach with deep learning models based on word embedding and lookup tables along with the method proposed in zhang2015character and conneau2017very using same datasets. In Section 5, experimental results of our proposed approach are shown, highlighting that, in some cases, it overtakes state-of-the-art results while in other cases, it obtains comparative results.

conv1 conv2 conv3 conv4 conv5
Figure 5: An example of five feature maps (conv1,…, conv5) displayed over the input image of the DBPedia dataset. In this particular example images are encoded with Word2Vec features and pixels of space between visual words. The convolutional map generated by the conv1 layer shows activations of individual superpixel or sequence of superpixels, while other convolutional layers show larger activation areas affecting more visual words.

3 Proposed Approach

3.1 Encoding Scheme

The proposed encoding approach is based on Word2Vec word embedding Mikolov2013nips . We encode a word belonging to a document into an artificial image of size . The approach uses a dictionary with each word associated with a feature vector obtained from a trained version of Word2Vec word embedding model. Given a word , we obtained a visual word having width that contains subset of a feature vector, called superpixels (see example in Fig. 2). A superpixel is a square area of size pixels with a uniform color that represents a sequence of contiguos features extracted as a sub-vector of . A graphical representation is shown in Fig. 4. We normalize each component to assume values in the interval with respect to , then we interpret triplets from feature vector as RGB sequence. For this very reason, we use feature vector with a length multiple of .

The blank space around each visual word plays an important role in the encoding approach. We found out that the parameter is directly related to the shape of a visual word. For example if pixels then must also have a value close to pixels to let the network understand where a word ends and another begins. A CNN in the first convolutional layer mainly extracts features for a single superpixel (see the conv1 example in Fig. 5) and so it is important to understand if two activated superpixels belong to same or different visual words.

Each convolutional layer produces convolutive maps from input to the last layer, as shown in Fig. 3 and Fig. 5. We noticed that the first convolutional layer recognizes some particular features of visual words associated to single or multiple superpixels. Remaining CNN layers, instead, aggregate these simple activations to create increasingly complex relationships between words or parts of a sentence in a text document, similarly to image classification tasks.

3.2 CNN Configuration

We encode the text document in an image to exploit the power of CNNs typically used in image classification. Usually, CNNs use “mirror” data augmentation technique to obtain robust models in image classification. In this work, this parameter is removed because mirroring an encoded text image changes the semantics of text documents and thus hampers the result. Another way to increase the training data in image classification, is to “crop” a certain number of patches from each training image and to use them as input to the CNN. This process has been used and slightly improved results of our approach.

The “stride” parameter is very primary in decreasing the complexity of the network, however, this value must not be bigger than the superpixel size, because larger values can skip too many pixels, which leads to information lost during the convolution, invalidating results.

4 Dataset

In this section we provide an overview of several large-scale data sets introduced in zhang2015character

which covers several classification tasks such as sentiment analysis, topic classification or news categorization. In addition, we used 20news-bydate dataset to test different parameters associated with the encoding approach. Table 

1 reports a summary statistics of datasets.

20news-bydate. We used 20news-bydate version of 20news dataset available on the web111 We selected major categories: comp, politics, rec and religion as described in hingmire2013document . These categories contain and samples for training and test sets respectively.

AG’s news corpus. We used the version created by Zhang et al. who selected largest classes from AG news corpus on the web zhang2015character 222 The number of training instances for each class is and for testing with each instance containing class index, title and description fields.

Sogou news corpus. Zhang et al. proposed a revised version of this dataset by selecting top categories from the SogouCA and SogouCS news corpora. The number of training instances selected for each class is and for testing with each instance containing title and content fields zhang2015character . This dataset is in Chinese language, Zhang et al. used a pypinyin package combined with the jieba Chinese segmentation system to produce the pinyin, a phonetic romanization of Chinese characters of words so that they were able to employ the same models for the English language without changes zhang2015character .

DBPedia ontology dataset. This dataset was created by selecting non-overlapping classes from DBPedia 2014. training samples and testing samples were randomly selected from these classes with each instance containing the title and abstract of each Wikipedia article.

Yelp reviews. Zhang et al. used Yelp Dataset Challenge 2015 dataset to perform two classification tasks zhang2015character . The first task predicts stars assigned by the user and the other task predicts the polarity label by considering stars 1 and 2 as negative while 3 and 4 as positive. These two tasks led to the construction of two datasets: the full dataset has training samples and testing samples for each star while the polarity dataset has training samples and test samples for each polarity.

Yahoo! Answers dataset. Zhang et al. obtained Yahoo! Answers Comprehensive Questions and Answers version 1.0 dataset through the Yahoo! Webscope program zhang2015character . The largest classes are selected from this corpus to create a topic classification dataset with each instance containing question title, question content and best answer. Each class contains training and testing samples.

Amazon reviews. Zhang et al. used Amazon review dataset from Stanford Network Analysis Project (SNAP) to perform full score prediction and create polarity prediction datasets zhang2015character . The full dataset contains training samples and testing samples in each class, whereas the polarity dataset contains about training and testing samples.

Dataset Classes Training Test Avg Std 20news-bydate 4 7,977 7,321 339 853 AG’s News 4 120,000 7,600 43 13 Sogou News 5 450,000 60,000 47 73 DBPedia 14 560,000 70,000 54 25 Yelp Review Polarity 2 560,000 38,000 138 128 Yelp Review Full 5 650,000 50,000 140 127 Yahoo! Answers 10 1,400,000 60,000 97 105 Amazon Review Full 5 3,000,000 650,000 91 49 Amazon Review Polarity 2 3,600,000 400,000 90 49
Table 1: Statistics of 20news-bydate and large-scale datasets presented in Zhang et al. zhang2015character . The

rightmost columns show the average (Avg) and standard deviation (Std) for the number of words contained in text documents.

Figure 6: Classification error using data augmentation: (mirror and crop) over the 20news-bydate test set.
Figure 7: Five different sizes of encoded image (, , , , ) obtained using the same document belonging to the 20news-bydate dataset. All images use the same encoding with Word2Vec features, space , superpixel size . It is important to note that the two leftmost images cannot represent all words in the document due to the small size.
Figure 8: Five encoded images obtained using different Word2Vec features length and using the same document belonging to the 20news-bydate dataset. All the images are encoded using space , superpixel size , image size and visual word width . The two leftmost images contain all words in the document encoded with and Word2Vec features respectively, while rightmost encoded images with , and features length cannot encode entire documents.
image size crop error 443 8.63 400x400 354 9.30 300x300 266 10.12 200x200 177 10.46 100x100 88 15.70 sp error 5x5 8.96 4x4 8.87 3x3 10.27 2x2 10.82 1x1 10.89 stride error 5 8.7 4 8.87 3 8.33 2 7.78 1 12.5 w2v Mw #w error 12 180 50% 9.32 24 140 64% 8.87 36 120 71% 7.20 48 100 79% 8.21 60 90 83% 20.66
Table 2: Comparison of different parameters over the 20news-bydate dataset. In the leftmost table we changed the size of the encoded image from to and the crop size is also changed by multiplying the image size with a constant i.e. . Here sp stands for superpixel, w2v is for number of Word2Vec features, Mw stands for Max number of visual words that an image can contain and #w is the number of text documents in the test set having a greater number of words than Mw. We fixed the remaining non-specified parameters as follow: , , , image size.
Figure 9: On the left, five different designs for visual words () represented by Word2Vec features, over the 20news-bydate dataset. The width V of these words is for the first two on the top and for the rest. The first four visual words consist of super pixels with different shapes to form particular visual words. On the right, a comparison over these different shapes of visual words.
w2v feat. error (%)
4 4 1 12 7.63
8 4 1 12 5.93
12 4 1 12 4.45
16 4 1 12 4.83
4 4 2 24 6.94
8 4 2 24 5.60
12 4 2 24 5.15
16 4 2 24 4.75
4 4 3 36 6.72
8 4 3 36 5.30
12 4 3 36 4.40
16 4 3 36 4.77
Table 3: Comparison between CNNs trained with different configurations on our proposed approach. The width (in superpixels) of visual words is fixed while the Word2Vec encoding vector size and space (in pixel) varies. is the height of visual word obtained.

5 Experiments

The aim of these experiments is threefold: (i) to evaluate configuration parameters associated with the encoding approach; (ii) to compare the proposed approach with other deep learning methods; (iii) to validate the proposed approach on a real-world application scenario.

In our experiments, percentage error is used to measure the classification performance. The encoding approach mentioned in Section 3.1 produces encoded image based on Word2Vec word embedding. These encoded images can be used to train and test a CNN. We used AlexNet krizhevsky2012imagenet and Googlenet szegedy2015going architectures as base models. We used a publicly available Word2Vec word embedding with default configuration parameters as in Mikolov2013nips to train word vectors on all datasets. Normally, Word2Vec is trained on a large corpus and used in different contexts. However, in our work, we trained this model with the same training set for each dataset.

5.1 Parameters setting

We used 20news-bydate dataset to perform a series of experiments with different settings to find out the best configuration for the proposed method. In our first experiment, we changed the space among visual words and the number of Word2Vec features to identify relationships between these parameters. The result is shown in Table 3; we used the best result from this table to perform other experiments shown in Table 4. We obtained a lower percentage error with higher values of parameter and higher number of Word2Vec features. The length of feature vector depends on the nature of the dataset. For example in Fig. 8, a text document composed of a large number of words cannot be encoded completely using high number of Word2Vec features, because each visual word occupies more space in the encoded image. Furthermore, we found out that error does not decrease linearly with the increase of Word2Vec features, as shown in Table 2.

We tested various shapes for visual words before selecting the best one, as shown in Fig. 9 (on the left). We showed that with rectangular shaped visual words, we obtained higher results, as highlighted in Fig. 9 (on the right). Moreover, the space between visual words plays an important role in the final classification, in fact using a high value for the parameter, the convolutional layer can effectively distinguish among visual words, also demonstrated from the results in Table 3. As shown in Fig. 5, the first level of a CNN (conv1) specializes convolution filters in the recognition of a single superpixel. Hence, it is important to distinguish between superpixels of different visual words by increasing the parameter .

These experiments led us to the conclusion that we have a trade-off between the number of Word2Vec features to encode each word and the number of words that can be represented in an image. In fact, increasing the number of Word2Vec features increases the space required in the encoded image to represent a single word. Moreover, this aspect affects the maximum number of words that can fit in an image. The choice of this parameter must be done considering the nature of the dataset, whether it is characterized by short or long text documents. For our experiments, we used a value of for Word2Vec features, considering results presented in Table 2.

5.2 Data augmentation

We showed that the mirror data augmentation technique, successfully used in image classification, is not recommended here because it changes the semantics of the encoded words and can deteriorate the classification performance. Results are presented in Fig. 6. In addition, we showed that increasing the number of training samples by using the crop parameter, results are improved. More precisely, during the training and test phases, random crops are extracted from a image (or proportional crop for different image size, as reported in the leftmost Table 4) and then fed to the network. During the testing phase we extracted a patch from the center of the image.

Model AG Sogou DBP. Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.
Zhang et al. 7.64 2.81 1.31 4.36 37.95 28.80 40.43 4.93
Conneau et al. 8.67 3.18 1.29 4.28 35.28 26.57 37.00 4.28
Encoding scheme + AlexNet 9.19 8.02 1.36 11.55 49.00 25.00 43.75 3.12
Encoding scheme + GoogleNet 7.98 6.12 1.07 9.55 43.55 24.90 40.35 3.01
Table 4: Testing error of our encoding approach on datasets with Alexnet and GoogleNet. In addition, results obtained by Zhang et al. zhang2015character and Conneau et al. conneau2017very are included.

5.3 Encoded image size

We used various image sizes for the encoding approach. Fig. 7 shows artificial images built on top of Word2Vec features with different sizes. As illustrated in Table 2, percentage error decreases by increasing the size of an encoded image; however, we observed that sizes above is computationally intensive; hence, this lead us to chose an image size of , typically used in AlexNet and GoogleNet architectures.

5.4 Comparison with other state-of-the-art text classification methods

We compared our approach with state-of-the-art methods. Zhang et al. presented a detailed analysis between traditional and deep learning methods. From the paper, we selected best results and reported them in Table 4. In addition, we also compared our results with Conneau et al. We obtained state-of-the-art results on DBPedia, Yahoo Answers! and Amazon Polarity datasets, while comparative results on AGnews, Amazon Full and Yelp Full datasets. However, we obtained higher error on Sogou dataset due to the translation process explained in Section 4.

5.5 Comparison with state-of-the-art CNNs

We obtained better performance using GoogleNet, as expected. This lead us to believe that recent state-of-the-art network architectures, such as Residual Network would further improve results. To work successfully with huge datasets and powerful models, a high-end hardware and large training time are required, thus we conducted experiments only on 20news-bydate dataset with three network architectures: AlexNet, GoogleNet and ResNet. Results are shown in Table 5. We achieved better performance with powerful network architecture.

CNN architecture error
Encoding scheme + AlexNet 4.10
Encoding scheme + GoogleNet 3.81
Encoding scheme + ResNet 2.95
Table 5: Percentage errors on 20news-bydate dataset with three different CNNs.
Dataset Image Text Proposed
Amazon 53.93 35.93 27.48
Ferramenta 7.64 12.14 5.16

Table 6: Percentage error between proposed approach and single sources.

(a) (b) (c)
Figure 10: An example of multi-modal fusion from the Amazon dataset belonging to the class ”Baby”. (a) shows the original image, (b) is a encoded text image and (c) shows the image with the superimposition of the encoded text in the upper part. The text in this example contains only the following 4 words ”Kidco Safeway white G2000”. The size of all images is .

5.6 An Application Scenario

5.6.1 Classification Scenario

One of the main advantages of the proposed approach is the ability of converting a text document into the same feature space of an image. We therefore believe that the proposed approach can be exploited to improve the solution of problems requiring the combined use of image and text. For example, an advertisement on e-commerce websites consists of image and text description, therefore it becomes useful to exploit both these sources in a multi-modal strategy.

Therefore, we use two multi-modal datasets to demonstrate that our approach brings significant benefits. The first dataset named Ferramenta dataset gallo2017multimodal consists of advertisements split in adverts for train and for test sets, belonging to classes. We used another publicly available real-world dataset called Amazon Product Data he2016ups . We randomly selected adverts belonging to classes. Finally, we split advertisements for each class into train and test sets with and advertisements respectively.

Both datasets provide a text and representative image for each advertisement. Text descriptions in Ferramenta dataset are in Italian language and we preferred not to translate to English, because we believe that the translation process would alter the nature of the text descriptions. However, text descriptions in Amazon Product Data dataset are in English. We compare the classification of advertisement using only the encoded text description, only the image and their combination. A sample image for these experiments is shown in Fig. 10. The model trained on images only for Amazon dataset obtained the following first two predictions: 77.42% Baby and 11.16% Home and Kitchen on this example. While the model trained on the multi-modal Amazon dataset obtained the following first two predictions: 100% Baby and 0% Patio Lawn and Garden for the same example. This indicates that our encoding technique leads to a better classification result than the classification of text or image alone. Table 6 shows that the combination of text and image into a single image, outperforms best result obtained using only a uni-modality on both multi-modal datasets.

5.6.2 Retrieval Scenario

The work in nawaz2018revisiting employs our encoding scheme for cross-modal retrieval system. With our encoding scheme, the retrieval system uses a single network trained in an end-to-end fashion. The results presented verify that raw text features are not necessary to map text descriptions to a similar embedding space with their respective images.

6 Conclusion

In this work, we presented a novel text classification approach to transform Word2Vec features of text documents into artificial images to exploit CNNs capabilities for text classification. We obtained state-of-the-art results on some datasets while in other cases our approach obtained comparative results. We showed that the CNN model generally used for image classification is successfully employed for different task such as text classification. As shown in the experiment section, the trend in results clearly show that, we can further improve results with more recent and powerful deep learning models for image classification.

7 Acknowledgment

We gratefully acknowledge the support of NVIDIA Corporation, which donated the GeForce GTX 980 GPU used in some experiments. We also acknowledge E4 Computer Engineering S.p.a. for providing an NVIDIA DGX-1, which allowed us to run experiments on computationally intensive datasets.


  • (1) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug), 2493–2537 (2011)
  • (2) Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, pp. 1107–1116 (2017)
  • (3) Gallo, I., Calefati, A., Nawaz, S.: Multimodal classification fusion in real-world scenarios. In: Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 5, pp. 36–41. IEEE (2017)
  • (4) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.

    In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  • (5) He, R., McAuley, J.: Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: proceedings of the 25th international conference on world wide web, pp. 507–517. International World Wide Web Conferences Steering Committee (2016)
  • (6) Hingmire, S., Chougule, S., Palshikar, G.K., Chakraborti, S.: Document classification by topic labeling. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 877–880. ACM (2013)
  • (7)

    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features.

    Machine learning: ECML-98 pp. 137–142 (1998)
  • (8) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014)
  • (9)

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks.

    In: Advances in neural information processing systems, pp. 1097–1105 (2012)
  • (10)

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality.

    In: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.) Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 3111–3119. Curran Associates, Inc. (2013)
  • (11) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • (12) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)
  • (13) Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: ACL (1), pp. 1555–1565 (2014)
  • (14) Wehrmann, J., Mattjie, A., Barros, R.C.: Order embeddings and character-level convolutions for multimodal alignment. Pattern Recognition Letters 102, 15–22 (2018)
  • (15) Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp. 649–657 (2015)
  • (16) Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.D.: Dual-path convolutional image-text embedding. arXiv preprint arXiv:1711.05535 (2017)