1 Introduction
Characteristic metrics are a set of unsupervised measures that quantitatively describe or summarize the properties of a data collection. These metrics generally do not use groundtruth labels and only measure the intrinsic characteristics of data. The most prominent example is descriptive statistics that summarizes a data collection by a group of unsupervised measures such as mean or median for central tendency, variance or minimummaximum for dispersion, skewness for symmetry, and kurtosis for heavytailed analysis.
In recent years, text classification, a category of Natural Language Processing (NLP) tasks, has drawn much attention [Zhang et al.2015, Joulin et al.2016, Howard and Ruder2018] for its wideranging realworld applications such as fake news detection [Shu et al.2017], document classification [Yang et al.2016], and spoken language understanding (SLU) [Gupta et al.2019a, Gupta et al.2019b, Zhang et al.2018], a core task of conversational assistants like Amazon Alexa or Google Assistant.
However, there are still insufficient characteristic metrics to describe a collection of texts. Unlike numeric or categorical data, simple descriptive statistics alone such as word counts and vocabulary size are difficult to capture the syntactic and semantic properties of a text collection.
In this work, we propose a set of characteristic metrics: diversity, density, and homogeneity to quantitatively summarize a collection of texts where the unit of texts could be a phrase, sentence, or paragraph. A text collection is first mapped into a highdimensional embedding space. Our characteristic metrics are then computed to measure the dispersion, sparsity, and uniformity of the distribution. Based on the choice of embedding methods, these characteristic metrics can help understand the properties of a text collection from different linguistic perspectives, for example, lexical diversity, syntactic variation, and semantic homogeneity. Our proposed diversity, density, and homogeneity metrics extract hardtovisualize quantitative insight for a better understanding and comparison between text collections.
To verify the effectiveness of proposed characteristic metrics, we first conduct a series of simulation experiments that cover various scenarios in twodimensional as well as highdimensional vector spaces. The results show that our proposed quantitative characteristic metrics exhibit several desirable and intuitive properties such as robustness and linear sensitivity of the diversity metric with respect to random downsampling. Besides, we investigate the relationship between the characteristic metrics and the performance of a renowned model, BERT
[Devlin et al.2018], on the text classification task using two public benchmark datasets. Our results demonstrate that there are high correlations between text classification model performance and the characteristic metrics, which shows the efficacy of our proposed metrics.2 Related Work
A building block of characteristic metrics for text collections is the language representation method. A classic way to represent a sentence or a paragraph is ngram, with dimension equals to the size of vocabulary. More advanced methods learn a relatively low dimensional latent space that represents each word or token as a continuous semantic vector such as word2vec
[Mikolov et al.2013], GloVe [Pennington et al.2014], and fastText [Mikolov et al.2017]. These methods have been widely adopted with consistent performance improvements on many NLP tasks. Also, there has been extensive research on representing a whole sentence as a vector such as a plain or weighted average of word vectors [Arora et al.2016], skipthought vectors [Kiros et al.2015], and selfattentive sentence encoders [Lin et al.2017].More recently, there is a paradigm shift from noncontextualized word embeddings to selfsupervised language model (LM) pretraining. Language encoders are pretrained on a large text corpus using a LMbased objective and then reused for other NLP tasks in a transfer learning manner. These methods can produce contextualized word representations, which have proven to be effective for significantly improving many NLP tasks. Among the most popular approaches are ULMFiT
[Howard and Ruder2018], ELMo [Peters et al.2018], OpenAI GPT [Radford et al.2018], and BERT [Devlin et al.2018]. In this work, we adopt BERT, a transformerbased technique for NLP pretraining, as the backbone to embed a sentence or a paragraph into a representation vector.Another stream of related works is the evaluation metrics for cluster analysis. As measuring property or quality of outputs from a clustering algorithm is difficult, human judgment with cluster visualization tools
[Kwon et al.2017, Kessler2017] are often used. There are unsupervised metrics to measure the quality of a clustering result such as the CalinskiHarabasz score [Caliński and Harabasz1974], the DaviesBouldin index [Davies and Bouldin1979], and the Silhouette coefficients [Rousseeuw1987]. Complementary to these works that model crosscluster similarities or relationships, our proposed diversity, density and homogeneity metrics focus on the characteristics of each single cluster, i.e., intra cluster rather than inter cluster relationships.3 Proposed Characteristic Metrics
We introduce our proposed diversity, density, and homogeneity metrics with their detailed formulations and key intuitions.
Our first assumption is, for classification, highquality training data entail that examples of one class are as differentiable and distinct as possible from another class. From a finegrained and intraclass perspective, a robust text cluster should be diverse in syntax, which is captured by diversity. And each example should reflect a sufficient signature of the class to which it belongs, that is, each example is representative and contains certain salient features of the class. We define a density metric to account for this aspect. On top of that, examples should also be semantically similar and coherent among each other within a cluster, where homogeneity comes in play.
The more subtle intuition emerges from the interclass viewpoint. When there are two or more class labels in a text collection, in an ideal scenario, we would expect the homogeneity to be monotonically decreasing. Potentially, the diversity is increasing with respect to the number of classes since text clusters should be as distinct and separate as possible from one another. If there is a significant ambiguity between classes, the behavior of the proposed metrics and a possible new metric as a interclass confusability measurement remain for future work.
In practice, the input is a collection of texts , where is a sequence of tokens denoting a phrase, a sentence, or a paragraph. An embedding method then transforms into a vector and the characteristic metrics are computed with the embedding vectors. For example,
(1) 
Note that these embedding vectors often lie in a highdimensional space, e.g. commonly over
dimensions. This motivates our design of characteristic metrics to be sensitive to text collections of different properties while being robust to the curse of dimensionality.
We then assume a set of clusters created over the generated embedding vectors. In classification tasks, the embeddings pertaining to members of a class form a cluster, i.e., in a supervised setting. In an unsupervised setting, we may apply a clustering algorithm to the embeddings. It is worth noting that, in general, the metrics are independent of the assumed underlying grouping method.
3.1 Diversity
Embedding vectors of a given group of texts
can be treated as a cluster in the highdimensional embedding space. We propose a diversity metric to estimate the cluster’s dispersion or spreadness via a generalized sense of the radius.
Specifically, if a cluster is distributed as a multivariate Gaussian with a diagonal covariance matrix , the shape of an isocontour will be an axisaligned ellipsoid in . Such isocontours can be described as:
(2) 
where are all possible points in on an isocontour, is a constant, is a given mean vector with being the value along th axis, and is the variance of the th axis.
We leverage the geometric interpretation of this formulation and treat the square root of variance, i.e., standard deviation,
as the radius of the ellipsoid along theth axis. The diversity metric is then defined as the geometric mean of radii across all axes:
(3) 
where is the standard deviation or square root of the variance along the th axis.
In practice, to compute a diversity metric, we first calculate the standard deviation of embedding vectors along each dimension and take the geometric mean of all calculated values. Note that as the geometric mean acts as a dimensionality normalization, it makes the diversity metric work well in highdimensional embedding spaces such as BERT.
3.2 Density
Another interesting characteristic is the sparsity of the text embedding cluster. The density metric is proposed to estimate the number of samples that falls within a unit of volume in an embedding space.
Following the assumption mentioned above, a straightforward definition of the volume can be written as:
(4) 
up to a constant factor. However, when the dimension goes higher, this formulation easily produces exploding or vanishing density values, i.e., goes to infinity or zero.
To accommodate the impact of highdimensionality, we impose a dimension normalization. Specifically, we introduce a notion of effective axes, which assumes most variance can be explained or captured in a subspace of a dimension . We group all the axes in this subspace together and compute the geometric mean of their radii as the effective radius. The dimensionnormalized volume is then formulated as:
(5) 
Given a set of embedding vectors , we define the density metric as:
(6) 
In practice, the computed density metric values often follow a heavytailed distribution, thus sometimes its value is reported and denoted as .
3.3 Homogeneity
The homogeneity metric is proposed to summarize the uniformity of a cluster distribution. That is, how uniformly the embedding vectors of the samples in a group of texts are distributed in the embedding space. We propose to quantitatively describe homogeneity by building a fullyconnected, edgeweighted network, which can be modeled by a Markov chain model. A Markov chain’s entropy rate is calculated and normalized to be in
range by dividing by the entropy’s theoretical upper bound. This output value is defined as the homogeneity metric detailed as follows:To construct a fullyconnected network from the embedding vectors , we compute their pairwise distances as edge weights, an idea similar to AttriRank [Hsu et al.2017]^{1}^{1}1https://github.com/ntumslab/AttriRank/blob/master/attrirank.pdf. As the Euclidean distance is not a good metric in highdimensions, we normalize the distance by adding a power . We then define a Markov chain model with the weight of being
(7) 
(8) 
All the transition probabilities are from the transition matrix of a Markov chain. An entropy of this Markov chain can be calculated^{2}^{2}2https://en.wikipedia.org/wiki/Entropy_rate as
(9) 
where is the stationary distribution of the Markov chain. As selftransition probability is always zero because of zero distance, there are possible destinations and the entropy’s theoretical upper bound becomes
(10) 
Our proposed homogeneity metric is then normalized into as a uniformity measure:
(11) 
The intuition is that if some samples are close to each other but far from all the others, the calculated entropy decreases to reflect the unbalanced distribution. In contrast, if each sample can reach other samples within moreorless the same distances, the calculated entropy as well as the homogeneity measure would be high as it implies the samples could be more uniformly distributed.
4 Simulations
To verify that each proposed characteristic metric holds its desirable and intuitive properties, we conduct a series of simulation experiments in dimensional as well as dimensional spaces. The latter has the same dimensionality as the output of our chosen embedding methodBERT, in the following Experiments section.
4.1 Simulation Setup
The base simulation setup is a randomly generated isotropic Gaussian blob that contains data points with the standard deviation along each axis to be and is centered around the origin. All Gaussian blobs are created using make_blobs function in the scikitlearn package^{3}^{3}3https://scikitlearn.org/stable.
Four simulation scenarios are used to investigate the behavior of our proposed quantitative characteristic metrics:

Downsampling: Downsample the base cluster to be of its original size. That is, create Gaussian blobs with data points;

Varying Spread: Generate Gaussian blobs with standard deviations of each axis to be ;

Outliers: Add outlier data points, i.e., of the original cluster size, randomly on the surface with a fixed norm or radius;

Multiple Subclusters: Along the thaxis, with data points in total, create clusters with equal sample sizes but at increasing distance.
For each scenario, we simulate a cluster and compute the characteristic metrics in both dimensional and dimensional spaces. Figure 1 visualizes each scenario by tdistributed Stochastic Neighbor Embedding (tSNE) [Maaten and Hinton2008]. The dimensional simulations are visualized by downprojecting to
dimensions via Principal Component Analysis (PCA) followed by tSNE.
4.2 Simulation Results
Figure 2 summarizes calculated diversity metrics in the first row, density metrics in the second row, and homogeneity metrics in the third row, for all simulation scenarios.
The diversity metric is robust as its values remain almost the same to the downsampling of an input cluster. This implies the diversity metric has a desirable property that it is insensitive to the size of inputs. On the other hand, it shows a linear relationship to varying spreads. It is another intuitive property for a diversity metric that it grows linearly with increasing dispersion or variance of input data. With more outliers or more subclusters, the diversity metric can also reflect the increasing dispersion of cluster distributions but is less sensitive in highdimensional spaces.
For the density metrics, it exhibits a linear relationship to the size of inputs when downsampling, which is desired. When increasing spreads, the trend of density metrics corresponds well with human intuition. Note that the density metrics decrease at a much faster rate in higherdimensional space as logscale is used in the figure. The density metrics also drop when adding outliers or having multiple distant subclusters. This makes sense since both scenarios should increase the dispersion of data and thus increase our notion of volume as well. In multiple subcluster scenario, the density metric becomes less sensitive in the higherdimensional space. The reason could be that the subclusters are distributed only along one axis and thus have a smaller impact on volume in higherdimensional spaces.
As random downsampling or increasing variance of each axis should not affect the uniformity of a cluster distribution, we expect the homogeneity metric remains approximately the same values. And the proposed homogeneity metric indeed demonstrates these ideal properties. Interestingly, for outliers, we first saw huge drops of the homogeneity metric but the values go up again slowly when more outliers are added. This corresponds well with our intuitions that a small number of outliers break the uniformity but more outliers should mean an increase of uniformity because the distribution of added outliers themselves has a high uniformity.
For multiple subclusters, as more subclusters are presented, the homogeneity should and does decrease as the data are less and less uniformly distributed in the space.
To sum up, from all simulations, our proposed diversity, density, and homogeneity metrics indeed capture the essence or intuition of dispersion, sparsity, and uniformity in a cluster distribution.
5 Experiments
The two realworld text classification tasks we used for experiments are sentiment analysis and Spoken Language Understanding (SLU).
5.1 Chosen Embedding Method
BERT is a selfsupervised language model pretraining approach based on the Transformer [Vaswani et al.2017], a multiheaded selfattention architecture that can produce different representation vectors for the same token in various sequences, i.e., contextual embeddings.
When pretraining, BERT concatenates two sequences as input, with special tokens denoting the start, separation, and end, respectively. BERT is then pretrained on a large unlabeled corpus with objectivemasked language model (MLM), which randomly masks out tokens, and the model predicts the masked tokens. The other classification task is next sentence prediction (NSP). NSP is to predict whether two sequences follow each other in the original text or not.
In this work, we use the pretrained which has layers (L), selfattention heads (A), and hidden dimension (H) as the language embedding to compute the proposed data metrics. The offtheshelf pretrained BERT is obtained from GluonNLP^{4}^{4}4https://gluonnlp.mxnet.io/model_zoo/bert/index.html. For each sequence with length , BERT takes as input and generates embeddings at the token level. To obtain the sequence representation, we use a mean pooling over token embeddings:
(12) 
where . A text collection , i.e., a set of token sequences, is then transformed into a group of Hdimensional vectors .
We compute each metric as described previously, using three BERT layers L1, L6, and L12 as the embedding space, respectively. The calculated metric values are averaged over layers for each class and averaged over classes weighted by class size as the final value for a dataset.
DownSampling to 
Training Set Size  Accuracy  Diversity  Density  Homogeneity 

100%  67,350  0.9266  0.292  44.487  0.928 
90%  60,615  0.9323  0.292  44.367  0.927 
80%  53,880  0.9260  0.292  44.224  0.927 
70%  47,146  0.9266  0.292  44.071  0.925 
60%  40,411  0.9312  0.292  43.928  0.924 
50%  33,676  0.9300  0.292  43.672  0.922 
40%  26,941  0.9243  0.292  43.384  0.919 
30%  20,206  0.9300  0.292  43.148  0.917 
20%  13,471  0.9174  0.293  42.733  0.914 
10%  6,736  0.9071  0.294  41.972  0.908 

DownSampling to 
Training Set Size  IC Accuracy (%)  SL F1 (%)  Diversity  Density  Homogeneity 

100%  13,084  98.71  96.06  0.215  48.291  0.950 
90%  11,773  98.57  95.79  0.215  48.199  0.949 
80%  10,465  99.00  95.55  0.215  48.109  0.949 
70%  9,157  99.14  95.13  0.215  47.996  0.948 
60%  7,848  98.71  95.02  0.215  47.751  0.948 
50%  6,541  98.86  94.38  0.215  47.660  0.945 
40%  5,231  99.00  94.74  0.214  47.449  0.944 
30%  3,922  98.57  93.74  0.215  47.090  0.941 
20%  2,614  96.42  92.63  0.214  46.877  0.939 
10%  1,306  87.20  89.12  0.214  46.158  0.929 

5.2 Experimental Setup
In the first task, we use the SST2 (Stanford Sentiment Treebank, version 2) dataset [Socher et al.2013] to conduct sentiment analysis experiments. SST2 is a sentence binary classification dataset with train/dev/test splits provided and two types of sentence labels, i.e., positive and negative.
The second task involves two essential problems in SLU, which are intent classification (IC) and slot labeling (SL). In IC, the model needs to detect the intention of a text input (i.e., utterance, conveys). For example, for an input of I want to book a flight to Seattle, the intention is to book a flight ticket, hence the intent class is bookFlight. In SL, the model needs to extract the semantic entities that are related to the intent. From the same example, Seattle is a slot value related to booking the flight, i.e., the destination. Here we experiment with the Snips dataset [Coucke et al.2018]
, which is widely used in SLU research. This dataset contains test spoken utterances (text) classified into one of 7 intents.
In both tasks, we used the opensourced GluonNLP BERT model to perform text classification. For evaluation, sentiment analysis is measured in accuracy, whereas IC and SL are measured in accuracy and F1 score, respectively. BERT is finetuned on train/dev sets and evaluated on test sets.
We downsampled SST2 and Snips training sets from to with intervals being . BERT’s performance is reported for each downsampled setting in Table 1 and Table 2. We used entire test sets for all model evaluations.
To compare, we compute the proposed data metrics, i.e., diversity, density, and homogeneity, on the original and the downsampled training sets.
5.3 Experimental Results
We will discuss the three proposed characteristic metrics, i.e., diversity, density, and homogeneity, and model performance scores from downsampling experiments on the two public benchmark datasets, in the following subsections:
5.3.1 Sst2
In Table 1, the sentiment classification accuracy is without downsampling, which is consistent with the reported GluonNLP BERT model performance on SST2. It also indicates SST2 training data are differentiable between label classes, i.e., from the positive class to the negative class, which satisfies our assumption for the characteristic metrics.
Decreasing the training set size does not reduce performance until it is randomly downsampled to only of the original size. Meanwhile, density and homogeneity metrics also decrease significantly (highlighted in bold in Table 1), implying a clear relationship between these metrics and model performance.
5.3.2 Snips
In Table 2, the Snips dataset seems to be distinct between IC/SL classes since the IC accurcy and SL F1 are as high as and without downsampling, respectively. Similar to SST2, this implies that Snips training data should also support the interclass differentiability assumption for our proposed characteristic metrics.
IC accuracy on Snips remains higher than until we downsample the training set to of the original size. In contrast, SL F1 score is more sensitive to the downsampling of the training set, as it starts decreasing when downsampling. When the training set is only left, SL F1 score drops to .
The diversity metric does not decrease immediately until the training set equals to or is less than of the original set. This implies that random sampling does not impact the diversity, if the sampling rate is greater than . The training set is very likely to contain redundant information in terms of text diversity. This is supported by what we observed as model has consistently high IC/SL performances between  downsampling ratios.
Moreover, the biggest drop of density and homogeneity (highlighted in bold in Table 2) highly correlates with the biggest IC/SL drop, at the point the training set size is reduced from to . This suggests that our proposed metrics can be used as a good indicator of model performance and for characterizing text datasets.
6 Analysis
We calculate and show in Table 3 the Pearson’s correlations between the three proposed characteristic metrics, i.e., diversity, density, and homogeneity, and model performance scores from downsampling experiments in Table 1 and Table 2. Correlations higher than are highlighted in bold. As mentioned before, model performance is highly correlated with density and homogeneity, both are computed on the train set. Diversity is only correlated with Snips SL F1 score at a moderate level.
Dataset  SST2  Snips  Snips 

Task Evaluation Metrics  Acc.  IC Acc.  SL F1 
Corr. to Diversity  0.196  0.196  0.555 
Corr. to Density  0.637  0.637  0.716 
Corr. to Homogenity  0.716  0.958  0.983 
.
These are consistent with our simulation results, which shows that random sampling of a dataset does not necessarily affect the diversity but can reduce the density and marginally homogeneity due to the decreasing of data points in the embedding space. However, the simultaneous huge drops of model performance, density, and homogeneity imply that there is only limited redundancy and more informative data points are being thrown away when downsampling. Moreover, results also suggest that model performance on text classification tasks corresponds not only with data diversity but also with training data density and homogeneity as well.
7 Conclusions
In this work, we proposed several characteristic metrics to describe the diversity, density, and homogeneity of text collections without using any labels. Pretrained language embeddings are used to efficiently characterize text datasets. Simulation and experiments showed that our intrinsic metrics are robust and highly correlated with model performance on different text classification tasks. We would like to apply the diversity, density, and homogeneity metrics for text data augmentation and selection in a semisupervised manner as our future work.
8 Bibliographical References
References
 [Arora et al.2016] Arora, S., Liang, Y., and Ma, T. (2016). A simple but toughtobeat baseline for sentence embeddings.
 [Caliński and Harabasz1974] Caliński, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statisticstheory and Methods, 3(1):1–27.
 [Coucke et al.2018] Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D., Doumouro, C., Gisselbrecht, T., Caltagirone, F., Lavril, T., et al. (2018). Snips voice platform: an embedded spoken language understanding system for privatebydesign voice interfaces. arXiv preprint arXiv:1805.10190.
 [Davies and Bouldin1979] Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227.
 [Devlin et al.2018] Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 [Gupta et al.2019a] Gupta, A., Hewitt, J., and Kirchhoff, K. (2019a). Simple, fast, accurate intent classification and slot labeling for goaloriented dialogue systems. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 46–55.
 [Gupta et al.2019b] Gupta, A., Zhang, P., Lalwani, G., and Diab, M. (2019b). Casanlu: Contextaware selfattentive natural language understanding for taskoriented chatbots. arXiv preprint arXiv:1909.08705.
 [Howard and Ruder2018] Howard, J. and Ruder, S. (2018). Universal language model finetuning for text classification. arXiv preprint arXiv:1801.06146.
 [Hsu et al.2017] Hsu, C.C., Lai, Y.A., Chen, W.H., Feng, M.H., and Lin, S.D. (2017). Unsupervised ranking using graph structures and node attributes. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 771–779. ACM.
 [Joulin et al.2016] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
 [Kessler2017] Kessler, J. S. (2017). Scattertext: a browserbased tool for visualizing how corpora differ. arXiv preprint arXiv:1703.00565.
 [Kiros et al.2015] Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Skipthought vectors. In Advances in neural information processing systems, pages 3294–3302.
 [Kwon et al.2017] Kwon, B. C., Eysenbach, B., Verma, J., Ng, K., De Filippi, C., Stewart, W. F., and Perer, A. (2017). Clustervision: Visual supervision of unsupervised clustering. IEEE transactions on visualization and computer graphics, 24(1):142–151.
 [Lin et al.2017] Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., and Bengio, Y. (2017). A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130.

[Maaten and Hinton2008]
Maaten, L. v. d. and Hinton, G.
(2008).
Visualizing data using tsne.
Journal of machine learning research
, 9(Nov):2579–2605.  [Mikolov et al.2013] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
 [Mikolov et al.2017] Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2017). Advances in pretraining distributed word representations. arXiv preprint arXiv:1712.09405.
 [Pennington et al.2014] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
 [Peters et al.2018] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
 [Radford et al.2018] Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pretraining. URL https://s3uswest2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper. pdf.
 [Rousseeuw1987] Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65.
 [Shu et al.2017] Shu, K., Sliva, A., Wang, S., Tang, J., and Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1):22–36.
 [Socher et al.2013] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., and Potts, C. (2013). Parsing With Compositional Vector Grammars. In EMNLP.
 [Vaswani et al.2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
 [Yang et al.2016] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489.
 [Zhang et al.2015] Zhang, X., Zhao, J., and LeCun, Y. (2015). Characterlevel convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
 [Zhang et al.2018] Zhang, C., Li, Y., Du, N., Fan, W., and Yu, P. S. (2018). Joint slot filling and intent detection via capsule neural networks. arXiv preprint arXiv:1812.09471.
Comments
There are no comments yet.