1 Introduction
Geographic information related to Twitter enriches the availability of data resources. Such information is indispensable for various practical applications such as early earthquake detection Sakaki et al. (2010), infectious disease dispersion assessment Broniatowski et al. (2013), and regional user behavior assessment during an election period Caldarelli et al. (2014). However the application performance depends strongly on the number of geotagged tweets, which account for fewer than 0.5 % of all tweets Cheng et al. (2010).
To extend the possibilities of the geographic information, a great deal of effort has been devoted to specifying the geolocation automatically Han et al. (2014)
. These studies are classified roughly into two aspects, the
User level Cheng et al. (2010); Han et al. (2014); Jurgens et al. (2015); Rahimi et al. (2015) and the Message level Dredze et al. (2016); Liu and Huang (2016); Priedhorsky et al. (2014) prediction. The former predicts the residential area. The latter one predicts the place that the user mentioned. This study targeted the latter problem, with message level prediction, involving the following three levels of difficulty.First, most tweets lack information to identify the true geolocation. In general, many tweets do not include geolocation identifiable words. Therefore, it is difficult even for humans to identify a geolocation (Figure 0(b)).
Next, some location names involve ambiguity because a word refers to multiple locations. For example, places called “Portland” exist in several locations worldwide. Similarly, this ambiguity also arises within a single country, as shown in Figure 2. Although additional context words are necessary to identify the exact location, many tweets do not include such clue words for identifying the location. As for such tweet, the realvalued point estimation is expected to be degraded by regression towards the mean Stigler (1997).
Finally, if a user states the word that represents the exact location, the user is not necessarily there. In case the user describes several places in the tweet, contextual comprehension of the tweet is needed to identify the true location (Figure 0(c)).
In contrast to most studies, this study was conducted to resolve these issues based on the density estimation approach. A salient benefit of density estimation is to enable comprehension of the uncertainty related to the tweet user location because it propagates from the estimated distribution and handles tweets distributed to multiple points properly. Figure 1 shows each estimated density as a heatmap. The estimated distribution is concentrated near the true location (Figure 0(a)) and vice versa (Figure 0(b)) if the tweet includes plenty of clues. Furthermore, the densitybased approach can accommodate the representation of multiple output data, whereas the regressionbased approach cannot (Figure 0(c)).
The densitybased approach provides additional benefits for practical application. The estimated density appends the estimation reliability for each tweet as the likelihood value. For reliable estimation, the estimated density provides the high likelihood (Figure 0(a) and 0(c)) and vice versa (Figure 0(b)).
To realize this modeling, we propose a Convolutional Mixture Density Network (CMDN)
, a method for estimating the geolocation density estimation from text data. Actually, CMDN extracts valuable features using a convolutional neural network architecture and converts these features to mixture density parameters. Our experimentally obtained results reveal that not merely the high prediction performance, but also the reliability measure works properly for filtering out uncertain estimations.
2 Related work
Social media geolocation has been undertaken on various platforms such as Facebook Backstrom et al. (2010), Flickr Serdyukov et al. (2009), and Wikipedia Lieberman and Lin (2009). Especially, Twitter geolocation is the predominant field among all of them because of its availability Han et al. (2014).
Twitter geolocation methods are not confined to text information. Diverse information can facilitate geolocation performance such as meta information (time Dredze et al. (2016) and estimated age and gender Pavalanathan and Eisenstein (2015)), social network structures Jurgens et al. (2015); Rahimi et al. (2015), and user movement Liu and Huang (2016). Many studies, however, have attempted user level geolocation, not the message level. Although the user level geolocation is certainly effective for some applications, message level geolocation supports finegrained analyses.
However, priedhorsky2014inferring has attempted density estimation for message level geolocation by estimating a wordindependent Gaussian Mixture Model and then combining them to derive each tweet density. Although the paper proposed many weight estimation methods, many of them depend strongly on locally distributed words. Our proposed method, CMDN, enables estimate of the geolocation density from the text sequence in the EndtoEnd manner and considers the marginal context of the tweet via CNN.
3 Convolutional Neural Network for Regression (Unimodal output)
We start by introducing the convolutional neural network Fukushima (1980); LeCun et al. (1998)
(CNN) for a regression problem, which directly estimates the realvalued output as our baseline method. Our regression formulation is almost identical to that of CNN for document classification based on kim2014cnn. We merely remove the softmax layer and replace the loss function.
We assume that a tweet has length
(padded where necessary) and that
represents the word th index. We project the wordsto vectors
through an embedding matrix , , where represents the size of the embeddings. We compose the sentence vector by concatenating the word vectors , asIn that equation, represents vector concatenation.
To extract valuable features from sentences, we apply filter matrix for every part of the sentence vector with window size as
where stands for the number of feature maps, and
represents the activation function. As described herein, we employ ReLU
Nair and Hinton (2010) as the activation function.For each filter’s output, we apply 1max pooling to extract the most probable window feature
. Then we compose the abstracted feature vectors by concatenating each pooled feature as shown below.Finally, we estimate the real value output using the abstracted feature vector as
where signifies the output dimension, , denotes the regression weight matrix, and
represent the bias vectors for regression.
To optimize the regression model, we use two loss functions between the true value and estimated value . One is loss as
and another is robust loss function for outlier,
loss,where represents the sample size. The loss shrinks the outlier effects for estimation rather than the one.
4 Convolutional Mixture Density Network (Multimodal output)
In the previous section, we introduced the CNN method for regression problem. This regression formulation is good for addressing welldefined problems. Both the input and output have onetoone correspondence. Our tweet corpus is fundamentally unfulfilled with this assumption. Therefore, a more flexible model must be used to represent the richer information.
In this section, we propose a novel architecture for text to density estimation, Convolutional Mixture Density Network (CMDN), which is the extension of Mixture Density Network by bishop1994mdn. In contrast to the regression approach that directly represents the output values , CMDN can accommodate more complex information as the probability distribution. The CMDN estimates the parameters of Gaussian mixture model , where is the th mixture component using the same abstracted features in the CNN regression (Sec 3).
4.1 Parameter estimation by neural network
Presuming that density consists of
components of multivariate normal distribution
, then the number of each dimensional normal distribution parameters are ( parameters for each mean and diagonal values of the covariance matrix and parameters for correlation parameters ) between dimension outputs.For , each component of the parameters is represented as
To estimate these parameters, we first project the hidden layer into the required number of parameters space as
where and respectively denote the weight matrix and bias vector.
Although the parameter has a sufficient number of parameters to represent the mixture model, these parameters are not optimal for insertion to the parameters of the multivariate normal distribution , and its mixture weight .
The mixture weights
must be positive values and must sum to 1. The variance parameters
must be a positive real value. The correlation parameter must be . For this purpose, we transform realvalued outputs into the optimal range for each parameter of mixture density.4.2 Parameter conversion
For simplicity, we first decompose to each mixture parameter as
To restrict each parameter range, we convert each vanilla parameter as
The original MDN paper Bishop (1994) and its well known application of MDN for handwriting generation Graves (2013) used an exponential function for transforming variance parameter and a hyperbolic tangent function for the correlation parameter . However, we use softplus Glorot et al. (2011) for variance and softsign Glorot and Bengio (2010) for correlation.
Replacing the activation function in the output layer prevents these gradient problems. Actually, these gradient values are often exploded or nonexistent. Our proposed transformation is effective to achieve rapid convergence and stable learning.
4.3 Loss function for parameter estimation
To optimize the mixture density model, we use negative log likelihood as the training loss:
5 Experiments
In this section, we formalize our problem setting and clarify our proposed model effectiveness.
5.1 Problem setting
This study explores CMDN performance from two perspectives.
Our first experiment is to predict the geographic coordinates by which each user stated using the only single tweet content. We evaluate the mean and median value of distances measured using Vincenty’s formula Vincenty (1975) between the estimated and true geographic coordinates for the overall dataset.
The second experiment is used to filter out the unreliable estimation quantitatively using the likelihoodbased threshold. Each estimated density assigns the likelihood value for every point of the location. We designate the likelihood values as reliability indicators for the respective estimated locations. Then, we remove the estimation from the lowest likelihood value and calculate both the mean and median values. This indicator is filtered correctly out the unreliable estimations if the statistics decrease monotonically.
Location estimation by estimated density
In contrast to the regression approach, it is necessary to specify the estimation point from the estimated density of each tweet. In accordance with bishop1994mdn, we employ the mode value of estimated density as the estimated location . The mode value of the probability distribution can be found by numerical optimization, but it requires overly high costs for scalable estimation. For a simple and scalable approximation for seeking the mode value of the estimated density, we restrict the search space to each mean value of the mixture components as shown below:
5.2 Dataset
Our tweet corpus consists of 24,633,478 Japanese tweets posted from July 14, 2011 to July 31, 2012. Our corpus statistics are presented in Table 1. We split our corpus randomly into training for 20M tweets, development for 2M, and test for 2M.
Dataset  

# of tweets  24,633,478 
# of users  276,248 
Average # of word  16.0 
# of vocabulary  351,752 
5.3 Comparative models
We compare our proposed model effectiveness by controlling experiment procedures, which replace the model components onebyone. We also provide simple baseline performance. The following model configurations are presented in Table 2.
Mean: Mean value of the training data locations.
Median: Median value of the training data locations.
Enet: Elastic Net regression Zou and Hastie (2005)
, which consists of ordinary least squares regression with
regularization.MLPl1: Multi Layer Perceptron with loss.
CNNl2: Convolutional Neural Network for regression with loss based on kim2014cnn
CNNl1: Convolutional Neural Network for regression with loss.
MDN: Mode value of Mixture Density Network Bishop (1994)
CMDN: Mode value of Convolutional Mixture Density Network (Proposed)
5.4 Results
5.4.1 Geolocation performance
The experimental geolocation performance results are presented in Table 3.
Overall results show that our proposed model CMDN provides the lowest median error distance: CNNl1 is the lowest mean error distance. Also, CMDN gives similar mean error distances to those of CNNl1. Both CMDN and CNNl1 outperform all others in comparative models.
In addition to the results obtained for the feature extraction part, the CNNbased model consistently achieved better prediction performance than the vanilla MLPbased model measured by both the mean and median.
Method  Mean (km)  Median (km) 

CMDN  159.4  10.7 
CNNl1  147.5  28.2 
CNNl2  166.5  80.5 
MDN  251.4  92.9 
MLPl1  224.0  75.0 
MLPl2  226.8  140.3 
Enet  197.2  144.6 
Mean  279.4  173.9 
Median  253.7  96.1 
5.4.2 Likelihood based threshold
We show how the likelihoodbased threshold affects both mean and median statistics in Figure 3. The likelihoodbased threshold consistently decreases both statistics in proportion to the likelihood lower bound increases. Especially, these statistics dramatically decrease when the likelihood is between and . Several likelihood bounds results are presented in Figure 4. Although the model mistakenly estimates many incorrect predictions for the overall dataset (blue), the likelihood base threshold correctly prunes the outlier estimations.
6 Discussion
Our proposed model provides high accuracy for our experimental data. Moreover, likelihoodbased thresholds reveal a consistent indicator of estimation certainty. In this section, we further explore the properties of the CMDN from the viewpoint of difference of the loss function.
Loss function difference
We compare the CNNbased model with different loss functions, loss for regression CNNl2, CNNl1 and negative loglikelihood for mixture density model CMDN. The output distance distributions are shown in Figure 5.
Although loss denotes the worst performance measured by mean and median among CNNbased models (Table 3), it represents the lowest outlier ratio. Consequently, the
loss is the most conservative estimate for our skewed corpus.
We can infer that the difference between the competitive model CNNl1 and our proposed model CMDN is their estimation aggressiveness. The median of CMDN is remarkably lower than of CNNl1, but the mean of CMDN is slightly larger than that of CNNl1. The reason for these phenomena is the difference of the loss functions behavior for multiple candidate data. Even though the loss function is robust to outliers, the estimation deteriorates when the candidates appear with similar possibilities. In contrast, CMDN can accommodate multiple candidates as multiple mixture components. Therefore, CMDN imposes the appropriate probabilities for several candidates and picks up the most probable point as the prediction. In short, CMDN’s predictions become more aggressive than CNNl1’s. Consequently, CMDN’s median value becomes lower than CNNl1’s.
An important shortcoming of aggressive prediction is that the estimation deteriorates when the estimation fails. The mean value tends to be affected strongly by the outlier estimation. Therefore, CMDN’s mean value becomes higher than that of CNNl1’s.
However, CMDN can overcome this shortcoming using a likelihoodbased threshold, which first filters out the outlier. Therefore, we conclude that the negative loglikelihood for mixture density is better than those of other loss functions and loss for the regression.
7 Future work
This study assessed the performance of our proposed model, CMDN, using text data alone. Although textonly estimation is readily applicable to existing resources, we still have room for improvement of the prediction performance. The winner of the Twitter Geolocation Prediction Shared Task, miura2016simple, proposed that the unified architecture handle several metadata such as the user location, user description, and time zone for predicting geolocations. The CMDN can integrate this information in the same manner.
Furthermore, liu2016where reports that the user home location strongly affects location prediction for a single tweet. For example, a routine tweet is fundamentally unpredictable using text contents alone, but if the user home location is known, this information is a valuable indication for evaluating the tweet. As future work, we plan to develop a unified architecture that incorporates user movement information using a recurrent neural network.
In contrast, our objective function might be no longer useful for world scale geolocation because ours approximates the spherical coordinates into the real coordinate space. This approximation error tends to become larger for the larger scale geolocation inference. We will explore our method’s geolocation performance using the world scale geolocation dataset such as WNUT data Han et al. (2016).
8 Conclusion
This study clarified the capabilities of the density estimation approach to Twitter geolocation. Our proposed model, CMDN, performed not only with high accuracy for our experimental data; it also extracted reliable geolocated tweets using likelihoodbased thresholds. Results show that CMDN merely requires the tweet message contents to identify its geolocation, while obviating preparation of metainformation. Consequently, CMDN can contribute to extension of the fields in which geographic information application can be used.
References
 Backstrom et al. (2010) Lars Backstrom, Eric Sun, and Cameron Marlow. 2010. Find me if you can: improving geographical prediction with social and spatial proximity. In Proceedings of the 19th international conference on World wide web. ACM, pages 61–70.
 Bishop (1994) Christopher Bishop. 1994. Mixture density networks. Technical report.
 Broniatowski et al. (2013) David A. Broniatowski, Michael J. Paul, and Mark Dredze. 2013. National and local influenza surveillance through twitter: An analysis of the 20122013 influenza epidemic. PLoS ONE 8(12). https://doi.org/10.1371/journal.pone.0083672.
 Caldarelli et al. (2014) Guido Caldarelli, Alessandro Chessa, Fabio Pammolli, Gabriele Pompa, Michelangelo Puliga, Massimo Riccaboni, and Gianni Riotta. 2014. A multilevel geographical study of italian political elections from twitter data. PLoS ONE 9(5):1–11. https://doi.org/10.1371/journal.pone.0095809.
 Cheng et al. (2010) Zhiyuan Cheng, James Caverlee, and Kyumin Lee. 2010. You are where you tweet: A contentbased approach to geolocating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, USA, CIKM ’10, pages 759–768. https://doi.org/10.1145/1871437.1871535.
 Dredze et al. (2016) Mark Dredze, Miles Osborne, and Prabhanjan Kambadur. 2016. Geolocation for twitter: Timing matters. In North American Chapter of the Association for Computational Linguistics (NAACL).
 Efron (1979) B Efron. 1979. Bootstrap methods: Another look at the jackknife. The Annals of Statistics pages 1–26.

Fukushima (1980)
Kunihiko Fukushima. 1980.
Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological cybernetics 36(4):193–202.  Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Aistats. volume 9, pages 249–256.
 Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Aistats. volume 15, page 275.
 Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 .

Han et al. (2014)
Bo Han, Paul Cook, and Timothy Baldwin. 2014.
Textbased twitter user
geolocation prediction.
Journal of Artificial Intelligence Research
49:451–500. https://doi.org/10.1613/jair.4200. 
Han et al. (2016)
Bo Han, AI Hugo, Afshin Rahimi, Leon Derczynski, and Timothy Baldwin. 2016.
Twitter geolocation prediction shared task of the 2016 workshop on noisy usergenerated text.
WNUT 2016 page 213.  Jurgens et al. (2015) David Jurgens, Tyler Finethy, James McCorriston, Yi Tian Xu, and Derek Ruths. 2015. Geolocation prediction in twitter using social networks: A critical analysis and review of current practice. In ICWSM. pages 188–197.

Kim (2014)
Yoon Kim. 2014.
Convolutional neural
networks for sentence classification.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
. Association for Computational Linguistics, Doha, Qatar, pages 1746–1751. http://www.aclweb.org/anthology/D141181.  Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 Lieberman and Lin (2009) Michael D Lieberman and Jimmy J Lin. 2009. You are where you edit: Locating wikipedia contributors through edit histories. In ICWSM.
 Liu and Huang (2016) Zhi Liu and Yan Huang. 2016. Where are you tweeting?: A context and user movement based approach. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, New York, NY, USA, CIKM ’16, pages 1949–1952. https://doi.org/10.1145/2983323.2983881.
 Minsky and Papert (1969) Marvin Minsky and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA, USA.
 Miura et al. (2016) Yasuhide Miura, Motoki Taniguchi, Tomoki Taniguchi, and Tomoko Ohkuma. 2016. A simple scalable neural networks based model for geolocation prediction in twitter. WNUT 2016 9026924:235.

Nair and Hinton (2010)
Vinod Nair and Geoffrey E Hinton. 2010.
Rectified linear units improve restricted boltzmann machines.
In
Proceedings of the 27th international conference on machine learning (ICML10)
. pages 807–814.  Pavalanathan and Eisenstein (2015) Umashanthi Pavalanathan and Jacob Eisenstein. 2015. Confounds and consequences in geotagged twitter data. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP). http://www.aclweb.org/anthology/D/D15/D151256.pdf.
 Priedhorsky et al. (2014) Reid Priedhorsky, Aron Culotta, and Sara Y Del Valle. 2014. Inferring the origin locations of tweets with quantitative confidence. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, pages 1523–1536.
 Rahimi et al. (2015) Afshin Rahimi, Duy Vu, Trevor Cohn, and Timothy Baldwin. 2015. Exploiting text and network context for geolocation of social media users. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pages 1362–1367. http://www.aclweb.org/anthology/N151153.
 Sakaki et al. (2010) Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes twitter users: Realtime event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web. ACM, New York, NY, USA, WWW ’10, pages 851–860. https://doi.org/10.1145/1772690.1772777.
 Serdyukov et al. (2009) Pavel Serdyukov, Vanessa Murdock, and Roelof Van Zwol. 2009. Placing flickr photos on a map. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, pages 484–491.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958.
 Stigler (1997) Stephen M Stigler. 1997. Regression towards the mean, historically considered. Statistical methods in medical research 6(2):103–114.
 Vincenty (1975) Thaddeus Vincenty. 1975. Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations. Survey review 23(176):88–93.
 Zou and Hastie (2005) Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2):301–320.