Over the last ten years, social networks have grown and engaged a massive amount of users. Among them, Twitter is one of the most popular, reaching over 300 million users by the quarter of . On Twitter, users publish short messages of 140 characters or less called tweets, which can be seen by followers or by the public. Tweets can also be re-published by users who have seen the tweets, a process known as retweeting. This way, information can be spread quickly and widely throughout the whole Twitter network. Twitter can even be considered as a human-powered sensing network, with a lot of useful information, yet, in an unstructured form. For this reason, automatic mining and extracting meaningful information from the massive amount of Twitter data is of great significance , .
A very useful piece of information on Twitter is user location, which enables several applications including event detection , online community analysis , social unrest forecasting  and location-based recommendation [7, 8]. As another example, user location information can be useful for online marketers and governments to understand trends and patterns ranging from customer and citizen feedback  to the mapping of epidemics in concerned geographical areas . In 2009, Twitter enabled a geo-tagging feature, with which users can choose to geo-tag their tweets while posting. However, the majority of tweets are not geo-tagged by the users . Alternatively, users’ location might be available via their profile data. Nonetheless, not many users disclose their location via their Twitter profile, or the provided information is often unreliable. For example, a user might share vague or non-existent places such as ”Everywhere” and ”Small town, RW Texas”. This results in a quest for geolocation algorithms that can automatically analyze and infer the location of Twitter users.
The Twitter geolocation problem can be addressed at two different levels, namely, the tweet level and the user level. The former aims at predicting the location of single tweets, while the latter aims at inferring the location of a user from the data generated by that user. The geolocation of single tweets is extremely difficult due to the limited availability of information. Research on single tweet geolocation has been conducted [12, 13], but a good accuracy can be achieved only under specific constraints, which are normally not applicable in real-life situations. On the other hand, the Twitter geolocation at user level, also refered to as Twitter user geolocation, is more common, with plenty of methods described in the literature . In this paper, we focus on the geolocation problem at user level instead of tweet level.
The Twitter user geolocation problem can be formulated under a classification or a regression setting. Under the classification setting, one can predict the location of users in terms of geographical regions, such as countries, states and cities. Under the regression setting, the task is to estimate the exact geocoordinates of the users. Both prediction settings are considered in this paper. It is worth mentioning that we address the regression problem from a classification point of view. Towards this end, we employ a map partitioning technique to divide the concerned geographical area into small regions corresponding to classes. The exact geocoordinates of Twitter users can be estimated using the classes’ centroids.
In the Twitter user geolocation literature, most of the existing algorithms follow either a content-based approach or a network-based approach. Content-based methods extract information from the textual contents of tweets to predict user locations [11, 15, 16]. Network-based methods, on the other hand, employ connections between users for geolocation [17, 18, 19]. Both approaches have achieved good geolocation accuracy [11, 20].
This paper explores a more generic approach, which inherits the advantages of both content-based and network-based strategies. Our approach leverages recent advances in deep neural networks (i.e., deep learning) and multiview learning. Deep neural networks , have been proven to be very effective in many domains including image classification , machine translation , and compressive sensing . On the other hand, multiview learning is an emerging paradigm encompassing methods that learn from examples with multiple representations  showing a great progress recently [26, 27]
. In Twitter user geolocation, the views can be different types of information available on Twitter such as text and metadata, or even features extracted from the tweets themselves.
Our contributions in this work are as follows:
We propose a generic multiview neural network architecture, named multi-entry neural network (MENET), for Twitter user geolocation. MENET is capable of combining multiview features into a unified model to infer users’ location.
We show the effectiveness of using map partitioning techniques in Twitter user geolocation, especially with Google’s S2 partitioning library111https://code.google.com/archive/p/s2-geometry-library/. We have achieved state-of-the-art results on several popular datasets with these partitioning techniques.
We show a thorough analysis on the importance of input features and the impact of partitioning strategies on the performance of MENET.
The remainder of this paper is organized as follows. In Section 2, we review related works. Section 3 describes our method in details, including the model architecture, feature learning, feature extraction and how we improve our model with the density-driven map partitioning technique. Section 4 describes the performance criteria, the pre-processing procedures and details the parameter setting of our method. The results of our experiments are also presented in this section. Finally, we draw the conclusion and discuss future work in Section 5.
2 Related Work
Most current approaches for predicting the location of Twitter users are based either on user-generated content or on the social ties. The first approach, which has been investigated thoroughly, uses textual features from tweets to build location predictive models. The latter arises from an observation that a user often interacts with people in nearby areas , and exploits the network connections of users. This section will bring a closer look on recently published works for both approaches.
consider tweets and locations as the outputs of a generative process incorporating topics and regions as latent variables, thus geo-locating users by seeking to recover these variables. An alternative approach is using geographical Gaussian Mixture Models (GMMs) to model the distribution of terms of tweets across geographical areas. By calculating a weighted sum of corresponding GMMs on terms of tweets, a geographical density function can be found, revealing the location at the single tweet level. A smilar approach, making use of GMMs, is introduced by Chang et al. 
, where a GMM model is fit to the conditional probability of a certain city, given a term. Charet al.  estimate location by exploiting the expressiveness of sparse coding and the advances in dictionary learning to obtain the state of the art on a benchmark dataset named GeoText 
. Recently, several methods have addressed the Twitter user geolocation problem using deep learning. For example, Liu and Inkpen train stacked denoising autoencoders for predicting regions, states, and geographical coordinates. These vanilla models obtain quite good results with a pre-training procedure. These methods, however, do not take into account the natural distribution of Twitter users in the considered datasets over the different regions of interest. Concretely, the density of Twitter users is much higher in inner-city areas than countrysides. To exploit this attribute, grid-based geolocation methods are introduced in [33, 34, 35, 36]
, where adaptive or uniform grids are created to partition the datasets into geographical cells at different levels. The prediction of geographical coordinates is then converted to a classification problem using the cells as classes, and off-the-shelf classifiers can be applied directly. This strategy is also used in our method but with a different spliting scheme and with a novel model architecture.
Recent works have shown a correlation between the likelihood of friendship of two social network users and the geographical distance between them . Using this correlation, the location of users can be estimated using their friends’ location. This is the key idea behind the network-based approach. By leveraging the social interactions like bi-directional following222Twitter users follow other people to see their latest updates. Bi-directional following means two users follow each other. and bi-directional mentioning333Twitter users can mention other people in their tweets by typing @username. Bi-directional mentioning is the two-way interaction which happens when two users have mentioned each other., one can establish graphs of Twitter users where a label propagation algorithm  or its variants [38, 39] are used to identify locations of unlabeled users [40, 18, 19, 41]. The network-based approach has several advantages over the content-based counterpart, including language independence. Also, it does not require training, which is a very resource intensive and time-consuming process on big datasets. However, the inherent weakness of this approach is that it cannot propagate labels (locations) to users that are not connected to the graph. As a result, isolated users remain unlabeled.
To address the problem of isolated users in the network-based approach, unified text and network methods are proposed in [42, 20], which leverage both the discriminative power of textual information and the representativeness of the users’ graph. In particular, the textual information is used to predict labels for disconnected users before running label propagation algorithms. Additionally, the novelty of the works [42, 20] lies in building a densely undirected graph based on the mentioning of users. This makes a significant improvement in the location prediction. Following [42, 20], models combining text, metadata and user network features have been introduced [43, 13]. These models have to rely on user profile information including user location, user timezone and user UTC offset. These types of information should be considered unvailable in the Twitter user geolocation context. That is the reason why the three benchmark datasets considered in this paper do not provide the Twitter profile information.
Our method does not rely on the Twitter user profile information. It employs a similar graph of Twitter users derived from tweets as in ; however, instead of propagating labels through the graph, our method trains an embedding mapping function to capture the graph’s structure. The graph feature is then integrated with all other features in a neural network architecture. Our architecture is simpler as it does not require designing a specific architecture for each type of feature like in [43, 13], thus easier and less resource intensive to train.
3 Multi-entry Neural Network for Twitter User Geolocation
In Twitter user geolocation, we wish to predict the location of a user using textual information and metadata, obtained from a corpus of tweets sent by the user, as well as information extracted from the user’s network. Using this information, we predict either the area (alias, region), where the user most probably resides, or even the location of the user by means of geocoordinates. Our method addresses this problem as a classification problem. Concretely, for each considered dataset, we subdivide Twitter users into discrete geographical regions, which correspond to classes. We define the centroid of a region by the median value of the geocoordinates of all training users in that region. Once a test user is classified to a certain region, we consider the centroid of that region as the predicted geocoordinates.
We propose a generic neural network model to learn from multiple views of data for Twitter user geolocation. We coin the proposed model MENET. The advantage of this model is the capability of exploiting both content-based and network-based features, as well as other available features concurrently. In this work, we realize MENET with different types of features. These features capture not only the tweets’ content, but also the user network structure and time information. It is worth mentioning that except the time information, all other features are extracted from the tweets’ content. Hence, MENET works even in case tweets’ metadata is not available. Integrating all features into MENET results in a powerful method for geolocation. Combining this method with the Google S2 map partitioning technique, we achieve state-of-the-art results in several Twitter user geolocation benchmarks. This section presents our MENET model and the different types of employed features in detail.
3.1 Model Architecture
Our MENET architecture is illustrated in Fig. 1. The model leverages different features extracted from the tweets’ content and metadata. Each corresponds to one view of the network. In Fig. 1, features are put into individual branches. Each branch can contain multiple hidden layers allowing to learn higher order features.
Given multiple views of the input data, a straightforward approach to combine them is to use vector concatenation. Nevertheless, we argue that our architecture is more effective. Simple vector concatenation often does not fully utilize the power of multiple features. In MENET, each view is the input to one network branch, which comprises of a number of fully connected hidden layers. In order to learn a non-linear transformation function for each branch, we employ the ReLUactivation function after each hidden layer. The ReLU function is efficient for backpropagation and less prone to the vanishing gradient problem  than the tanh and sigmoid activation functions, hence, has been used widely in deep learning literature [46, 47]. The outputs of these branches are concatenated making a combined hidden layer. More fully connected layers can be added after this concatenation layer to gain more nonlinearity (see component Post-combined Hidden Layers in Fig. 1
). Again, ReLU is used to activate these layers. At the end, we employ a softmax layer to obtain the output probabilities.
We employ the cross-entropy loss as the objective function. Let be the number of examples and be the number of classes, then the cross-entropy loss is defined by:
where , is the ground-truth vector, is the predicted probability vector, namely, is the probability that user resides in region .
3.1.2 Training MENET
We train MENET using the stochastic gradient descent (SGD) algorithm, which optimizes the objective function in (1). In order to avoid overfitting, we use regularization and early stopping techniques. The regularization adds an additional term to the objective function, penalizing weights with big absolute values. Even though it is common practice to regularize weights in all layers, we empirically found that regularizing only the final output layer still effectively avoids overfitting, and does not affect the model’s capability. This, eventually, results in better classification results.
The parameters of MENET are fine-tuned using a separated set of examples, namely the development set. During training, the classification accuracy of the model on the development set is continuously monitored. If this metric does not improve for a pre-defined amount of consecutive steps , the training process is stopped. By using the same mechanism, the learning rate is also annealed when the training proceeds.
3.1.3 Testing MENET
To predict the location of users from the test set, we use the trained MENET model to classify these users into pre-defined classes (regions). The exact geocoordinates of a user is given by the centroid of the respective region. The performance of the MENET model is measured by either the accuracy in case of regional classification or distance error metrics (see Section 4.2) in case of geographical coordinates prediction.
3.2 Multiview Features
Figure 1 shows the capability of MENET in exploiting data from multiple sources. In the context of Twitter user geolocation, we realize MENET by leveraging features from textual information (Term Frequency - Inverse Document Frequency , doc2vec ), user interaction network (node2vec ) and metadata (timestamp). These features are all extracted from tweets provided they are available. The rest of this section will describe these features and how they are computed.
3.2.1 The Term Frequency - Inverse Document Frequency Feature
The Term Frequency - Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate how important a term is to a document in a collection or corpus. The importance increases proportionally to the number of times the term appears in the document but is offset by the frequency of the term in the corpus. TF-IDF is composed of two components, presented next.
Term Frequency (TF): It measures how often a term occurs in a document. The simplest choice is the raw frequency of the term in a document
where is the frequency of term in the document .
Inverse Document Frequency (IDF): It measures the informative quantity a term brings across documents. Concretely, a common term across multiple documents will be given a low weight while a rare term will have a higher weight. The IDF is defined as
with denoting the whole set of documents.
Then, the TF-IDF is defined by:
The output from (4) is normalized with the norm to have unit length. In fact, there are many variants for the definition of TF-IDF, and selecting one form depends on the specific situation. We use the formulations (2) and (3) following the existing implementation in the well-established library scikit-learn444http://scikit-learn.org/stable/ .
3.2.2 The Context Feature
The context feature is a mapping from a variable length block of text (e.g. sentence, paragraph, or entire document) to a fixed-length continuous valued vector. It provides a numerical representation capturing the context of the document. Originally proposed in , the context feature is also referred to as doc2vec or Distributed Representation of Sentences, and it is an extension of the broadly used word2vec model .
The intuition of doc2vec is that a certain context is more likely to produce some sets of words than other contexts. Doc2vec trains an embedding capable of expressing the relation between the context and the corresponding words. To achieve this goal, it employs a simple neural network architecture consisting of one hidden layer without an activation function. A text window samples some nearby words in a document; some of these words are used as inputs to the network and some as outputs. Moreover, an additional input for the document is added to the network bringing the document’s context. The training process is totally unsupervised. After training, the fixed representaion of the document input will capture the context of the whole document. Two architectures were proposed in  to learn a document’s representation, namely, Distributed Bag of Words (PV-DBOW) and Distributed Memory (PV-DM) versions of Paragraph Vector. Athough PV-DBOW is a simpler architecture, it has been claimed that PV-DBOW performs robustly if trained on large datasets . Therefore, we select PV-DBOW model to extract the context feature.
3.2.3 The Node2vec Feature
Node2vec is a method proposed in  to learn continuous feature representations (embeddings) for nodes in graphs. The low-dimensional feature vector represents the network neighborhoods of a node. Let be the set of nodes of a graph. Node2vec learns a mapping function that captures the connectivity patterns observed in the graph. Here, is a parameter specifying the dimensionality of the feature representation, and is a matrix of size . For every source node , a set of neighborhood nodes is generated through a neighborhood sampling strategy . Then, is obtained by maximizing the log-probability of observing the neighborhood , that is,
Node2vec employs a sampling method referred to as biased Random Walk , which samples nodes belonging to the neighborhood of node , according to discrete transition probabilities between the current node and the next node . These probabilities depend on the distance between the previous node and the next node . Denote by the distance in terms of number of edges from node to node , if the next node coincides with the previous node, then . If the next node has a direct connection to the previous node, then , and if the next node is not connected to the previous node, then . The transition probabilities are defined as follows :
where the parameters and are small positive numbers. The random walk sampling runs on nodes to obtain a list of walks. Later, the node’s embeddings are found from the set of walks using the stochastic gradient descent procedure.
In the context of Twitter user geolocation, each node corresponds to a user, while an edge is the connection between two users. We can define these connections by several criteria depending on the availability of data. For example, we may consider that two users are connected when actions such as following, mentioning or retweeting are detected. In this paper, the content of tweet messages is used to build graph connections. Similar to , , we construct an undirected user graph by employing mention connections. First, we create a unique set with all the users of interest. If a user mentions directly another user and both of them belong to , we create an edge reflecting this interaction. The edge is assigned a weight equal to the number of mentions. To avoid sparsity of the connections, if two users of interest mention a third user, who does not belong to , we create an edge between these two users. Again, the weight of this edge is the sum of mentions between the third user and the two others. Furthermore, we define a list of so-called celebrities consisting of users that have a number of unique connections exceeding a threshold . We remove all connections to these celebrities since the celebrities are often mentioned by plenty of people all over the world. Mentioning a celebrity, therefore, might not be a good indication of geographical relation. The graph building procedure is depicted in Fig. 2.
A shortcoming of this method is that it can only produce an embedding for a node if that node has at least one connection to another node. Nodes without an edge can not be represented. Therefore, for an isolated node, we consider an all-zero vector as its embedding. Moreover, whenever a new node joins the graph, the algorithm needs to run again to learn feature vectors for all the nodes of the graph, making our method inherently transductive. There are some existing efforts addressing this problem. In , the authors consider a node’s embedding as a function of its natural feature; in this case the embedding could be a function of either the TF-IDF or doc2vec feature. A similar approach presented in  generates a node’s embedding by sampling and aggregating features from the node’s local neighborhood. These inductive approaches will be considered in our future work.
3.2.4 The Timestamp Feature
In many commonly used Twitter databases like GeoText  and UTGeo2011 , the posting time of all tweets is available in UTC value (Coordinated Universal Time). This allows us to leverage another view of the data. In , it was shown that there exists a correlation between time and place in a Twitter stream of data. In fact, it is less likely that people tweet late at night than at any other time, which implies a drift in longitude. Therefore, the timestamp could be an indication for a time zone. We obtain the timestamp feature for a given user as follows. First, we extract the timestamps from all the tweets of that user and convert them to the standard format to extract the hour value. Then, a -dimensional vector is created corresponding to hours in a day; the -th element of this vector equals the number of messages posted by the user at the -th hour. This feature is normalized to a unit vector before feeding it to our neural network model.
3.3 Improvements with S2 adaptive grid
When addressing the prediction of users’ location as a classification problem, the geographical coordinate assigned to a user with unknown location equals the centroid of the class, which has been predicted for the user. A straightforward way to form the classes is taking administrative boundaries such as states, regions or countries. Such an approach brings large distance errors if the respective areas are large. Intuitively, the prediction accuracy could be improved if we increase the granularity level by defining classes that correspond to smaller areas. The tiling should also consider the distribution of users; very imbalanced custom classes should be avoided, otherwise, the training process will not be efficient. Therefore, finding an appropriate way to subdivide users into custom small geographical areas is critical.
An early work of Roller et al.  has built an adaptive grid using a k-d tree to partition data into custom classes. This partitioning, though considers the distribution of users, does not necessarily produce uniform cells at the same level. Here, we split the Twitter users in the training set into small areas called S2 cells, using Google’s S2 geometry library. This library is a powerful tool for partitioning the earth’s surface. Considering the earth as a sphere, the library hierarchically subdivides the sphere’s surface by projecting it on an enclosing cube. On each surface of the cube, a hierarchical partition is made using a spatial data structure named quad-tree. Each node on the tree represents an S2 cell, which corresponds to an area on the earth’s surface. The quad-tree used in the Google S2 geometry library has a depth of ; the root cell is assigned the lowest level of zero and the leaf cells are assigned the highest level of . The library outputs mostly uniform cells at the same level. For instance, the minimum area of level-12 cells is km2 and the maximum area of these cells is km2.
In this work, we build an S2 adaptive grid, aiming at a balanced tiling, meaning that the defined cells (geographical areas) contain a similar number of users. For this reason, we specify a threshold , as the maximum allowed number of users per cell. We build the adaptive grid from bottom to top. First, we identify the leaves corresponding to given geocoordinates. As long as the total number of users in children nodes (cells) is smaller than , we merge these nodes together; the children nodes’ users are assigned to the parent cell, i.e., a larger geographical area. We climb the tree gradually repeating this process. If we reach a specific level, , we stop the climb in order to avoid defining cells that correspond to large geographical areas; otherwise, the prediction error would increase. Figures 3 and 4 show the subdivision of users in S2 cells for the considered datasets.
GeoText: This is a small dataset containing more than tweets posted by unique users from contiguous states and Washington D.C. during the first week of March, 2010.
Tweets were filtered carefully before being put into the dataset to make sure that only relevant tweets are kept.
In this dataset, the geospatial coordinates of the first message of users were used as their primary location.
This was done originally by the author in  and followed by other authors [33, 42].
The dataset was already split into the training, development and testing sets with , and users, respectively.
For the downstream tasks, tweets from a user are concatenated making a tweet document.
UTGeo2011: This is a larger dataset which was created by the authors of . The dataset is also referred to as TwitterUS in many Twitter user geolocation publications [42, 20, 36].
The dataset contains approximately million tweets sent by users from the US.
In contrast to GeoText, this dataset is noisier, namely
many tweets have no location information.
To treat it similarly to GeoText, all the tweets from a specific user are concatenated into a single document;
a primary location is defined as the earliest valid coordinate of the tweets.
Ten thousand users are selected randomly to make the development set,
and the same amount is reserved for the evaluation set.
The remaining users form the training set.
TwitterWorld: This is the dataset created by the authors of . The dataset contains million tweets sent by million users from different countries in the world, of which ten thousand users are kept for each the development set and the testing set. Moreover, only tweets that are in English and close to a city are retained. The primary location of a user in this dataset is assigned the centre of the city where most of his tweets were sent. Different from GeoText and UTGeo2011, this dataset provides purely textual information; the timestamps of messages are not available.
The location of a user is indicated by a pair of real numbers, namely, latitude and longitude. However, classification models need discrete labels. For the datasets collected from the US, we follow [31, 33] to employ administrative boundaries to create the class labels. By doing so, we can consider the tasks of regional and state classification as in [31, 15, 11]. We rely on the Ray Casting algorithm of  to decide if a location is inside a region or state’s boundary. For the region and state boundaries, we use information from Census Divisions666https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf. Also, we have employed the Google S2 library, k-means and k-d tree clusterings to partition all the geospatial datasets, making other sets of labels. This supports the task of predicting the geocoordinates. More details on the settings of the partitioning schemes and their impacts will follow in the next section.
4.2 Performance Criteria and Experiment Design
The proposed model for geolocation of Twitter users addresses the following tasks: (i) four-way classification of US regions including Northeast, Midwest, West and South, (ii) fifty-way classification to predict the states of users, and (iii) estimation of the real-valued coordinates of users, i.e., latitude and longitude. For the region and state classification tasks, we compare the performance of our model with existing methods, by calculating the percentage of correctly classified users, which is the accuracy. Considering the estimation of the user coordinates, we measure the distance between the predicted and the actual geocoordinates and calculate the mean and the median values over the testing dataset. The distance between the predicted and the ground truth coordinates is computed using the Haversine formula . Another common way to measure the success of coordinate estimation is to calculate the percentage of estimations with accuracy better than km; this metric, known as @777 km mile, has been used in many works [33, 34, 35, 42, 20, 15]. It is worth noting that for the classification accuracy and the accuracy @ metrics, the higher values indicate a good prediction. Conversely, achieving lower values for the mean and median distance errors is desired.
Concerning the first two classification tasks, we conduct experiments on the US Twitter datasets, namely GeoText and UTGeo2011. For predicting Twitter users’ geocoordinates, experiments are performed on the three datasets. Furthermore, it should be noted that the experiments for geographical coordinate prediction use different sets of labels created by S2, -d tree and -means partitioning. Also, the administrative boundaries used in task (ii) are exploited for exact geocoordinate estimation.
4.3 Data Pre-processing and Normalization
, a dedicated library for natural language processing. Then, we replace URLs and punctuation by special characters, which results in reducing the size of the vocabulary without affecting the semantics of tweets. Again,nltk is used for stemming in the last stage of pre-processing.
Normalization is a common step to pre-process data before applying machine learning algorithms. Data can be normalized by removing the mean and dividing by the standard deviation. Alternatively, samples can be scaled into a small range ofor . The less common way is to scale the samples so that their module is equal to 1, also known as normalisation. In our case, the TF-IDF, node embedding and context features are already scaled to the range [0,1]. We apply normalization for the timestamp feature only.
4.4 Parameter Settings
Our framework considers four different features and each feature requires some parameters for extraction. Extracting TF-IDF using scikit-learn requires a minimum term frequency across documents min_df. For the GeoText dataset, we choose min_df=. For the UTGeo2011 and TwitterWorld datasets, because of the sheer volume of data, we set min_df= and min_df=, respectively. Concerning doc2vec, we select an embedding size equal to . The size of the sampling window is set to .
We have built the Twitter users’ graphs for the three datasets using mentions extracted from tweet messages only as discussed in Section 3.2.3. Following , we set the celebrity connection thresholds to , and for GeoText, UTGeo2011 and TwitterWorld, respectively. Table I shows graph statistics for all three datasets. We use the code provided by the authors of  to obtain the node2vec feature. We choose an embedding size equal to . When training the embeddings, we select the weighted graph option, which takes into account the weights of edges. Other parameters are set to default values, namely the walk length , transition parameters , . The sampling window size is set to .
|Region/State classification||Coordinates Prediction|
are the numbers of neurons in the hidden layersfor the TF-IDF, doc2vec, node2vec, and timestamp features, respectively.
Choosing the right hyperparameters for neural networks, which are the number and size of hidden layers, is always a challenge. In our experiments, these parameters are set empirically. We set the number of hidden layers on each individual branch to 1, namely we use hidden layers , , and for features TF-IDF, node2vec, doc2vec, and timestamp, respectively. Also, we connect the combination layer with the softmax layer directly without adding any layer in between. All hyperparameters can be found in Table II. We use a small value for the learning rate and regularize the weights right before the output layer only. The regularization parameter is set to . The training procedure is performed using stochastic gradient descent with the optimization algorithm ADAM  as the updating rule. The consecutively non-improving performance threshold is set to for GeoText and for both UTGeo2011 and TwitterWorld datasets.
Creating S2 grids requires setting the minimum cell level and maximum number of users per cell . We have experimented with different settings and reported the best result in Table IV with , for GeoText, , for UTGeo2011 and , for TwitterWorld.
After experimenting with different parameters, normalization techniques and feature combination strategies, we report here the best obtained results. Table III presents results for regional and state geolocation for GeoText and UTGeo2011, while for the prediction of user geographical coordinates, results are presented in Table IV.
Concerning the classification tasks, our model significantly outperforms all previous works. Successful regional classification is achieved for of users, while for state classification the result is . By leveraging the classification strength of multiple features, the improvement in regional accuracy is compared to the work in . Concerning the accuracy in state classification, we achieve a greater improvement that rises to compared to the state of the art presented in .
|Eisenstein et al. ||58||27||N/A||N/A|
|Cha et al. ||67||41||N/A||N/A|
|Liu & Inkpen ||61.1||34.8||N/A||N/A|
|Eisenstein et al. ||900||494||N/A||N/A||N/A||N/A||N/A||N/A||N/A|
|Wing et al. (2011) ||967||479||N/A||N/A||N/A||N/A||N/A||N/A||N/A|
|Roller et al. ||897||432||35.9||860||463||34.6||N/A||N/A||N/A|
|Wing & Baldridge (Uniform) ||N/A||N/A||N/A||703.6||170.5||49.2||1714.6||490||32.7|
|Wing & Baldridge (KD tree) ||N/A||N/A||N/A||686.6||191.4||48.0||1669.6||509.1||31.3|
|Melo et al. ||N/A||N/A||N/A||702||208||N/A||1507||502||N/A|
|Liu & Inkpen ||855.9||N/A||N/A||733||377||24.2||N/A||N/A||N/A|
|Cha et al. ||581||425||N/A||N/A||N/A||N/A||N/A||N/A||N/A|
|Rahimi et al. (2015) ||581||57||59||529||78||60||1403||111||53|
|Rahimi et al. (2017) ||578||61||59||515||77||61||1280||104||53|
|MENET with state labels||570||58||59.1||474||157||50.5||N/A||N/A||N/A|
|MENET with S2 labels||532||32||62.3||433||45||66.2||1044||118||53.3|
The estimation of geographical coordinates of Twitter users involves experiments with two types of labels, thus two sets of experiments. In the first set of experiments, we use classes corresponding to the fifty states of the US. In the second set of experiments, we employ the S2 classes described in Section 3. As can be seen in Table IV, concerning the results obtained with state labels, the mean distance error obtained with MENET is smaller than with other methods. Likewise, the median distance error and the @ accuracy are better on GeoText. However, our result with these metrics is worse on UTGeo2011. The reason being that the state boundaries ignore the geographical distribution of users. The performance of MENET is improved significantly over all criteria with S2 labels, when the definition of regions takes into account the distribution of users. In this case, Table IV shows that the proposed method outperforms existing methods in terms of mean, median distance error and @ accuracy on GeoText and UTGeo2011. On TwitterWorld, the median distance error is reduced more than compared to the result in  while the result for the other metrics is comparable to the state of the art. At this point, we would like to underline that the number of employed classes is critical for the performance of our method. A larger number of classes results in smaller geographical areas, which may improve the geocoordinate prediction. However, training a model with more classes may be more difficult, thus, the classification may perform worse.
4.5.1 Granularity Analysis
|Region count||Mean (km)||Median (km)||@161(%)|
|Region count||Mean (km)||Median (km)||@161(%)|
As explained in Section 3.3, an S2 adaptive grid is built using two parameters: the minimum S2 cell level and the maximum number of users per cell (region, class) . As an example, the geolocation result with S2 labels presented in Table IV for GeoText is associated with the minimum cell level of and the user threshold of . The number of cells and their area () will vary depending on these parameters. One may wonder if this setting is optimal or not. In this section, we present an analysis of the performance of MENET with regard to different S2 parameter settings. Concretely, we run experiments using the same hyperparameter setting of MENET on GeoText with different S2 label sets. The label sets are created by either varying the minimum S2 cell level or the user threshold . The results of these experiments are shown in Tables V and VI.
We can see a clear trend in the median of the distance error from the experiments with varying . When increases, meaning more regions are generated, the median of the distance error decreases monotonically to a very small value (i.e., km). The reason for this is very intuitive. S2 cells at a higher level have smaller area, and if the classification performance of MENET does not get significantly worse with more classes, a predicted location will be more likely closer to the ground truth location. This also explains the increasing trend in accuracy within
km. There is no clear trend in the mean of distance error. This could be explained by the sensitivity of the mean with regard to the outliers. Even if the classification accuracy of MENET goes down slightly, it may bring huge distance errors from large area cells. This has a large impact on the mean value. On the other hand, the impact of these outliers is small on the median value.
Table VI shows that when the maximum number of users per cell increases, fewer regions are created. The decreasing trend in the mean distance errors can be explained by the better classification performance when using less classes. Moreover, the median and @ remain stable within the range of for . The reason being that the classification accuracy in this range does not change significantly. The median and the @, however, are much worse with set to even when the corresponding number of regions is limited. The reason being, again, that the classification performance drops dramatically. The question arises: why is the classification accuracy so low? The reason is that splitting with this setting ignores the geographically natural distribution of data. In fact, an S2 cell at level is a good fit with the area of cities, where most of the tweets originate. If we lower the user threshold in a cell, the splitting algorithm will stop at much higher cell levels for cities where the tweet density is high, thus dividing the city area into multiple smaller regions. That explains why the classification performance is very low. Figures 5 and 6 show the subvidision of GeoText at level with different values of .
4.5.2 Feature Analysis
The MENET architecture has the capability of exploiting multiple features according to the multiview learning paradigm. In this paper, we realize the model using four features: TF-IDF, node2vec, doc2vec and timestamp. The question, then, arises: which feature contributes the most to the discriminative strength of the model? To answer this question, we conduct additional experiments with different combinations of features. Concretely, we eliminate one type of feature from the feature set and perform experiments with the rest. This can be done by temporarily removing a branch in MENET just before the concatenation layer (see Fig. 1). For a fair comparison, we use the same parameter setting for MENET as in the experiments with the full feature set. The results from the experiments on the GeoText dataset are presented in Table VII.
Compared to the results in Table IV, it is clear that the node2vec feature is the most important. Removing this feature results in a significant reduction of MENET’s performance in terms of mean distance error ( km), median distance error ( km) and accuracy within km ( %). The contribution of the doc2vec feature is also noticeable, indicating an increase of more than km in terms of mean distance error, compared to the full feature set. The other features help to improve the performance slightly as removing them results in a marginal decrease in the three performance criteria.
|Label Type||Mean (km)||Median (km)||@161 (%)|
4.5.3 Performance of MENET with regard to Observed Partitioning
In Table IV, we have a notable improvement in geolocation result with MENET by using S2 labels. Using Google’s S2 geometry library is one of many ways to create label sets for our classification problem. A similar partitioning strategy to Google’s S2 library is called Hierarchical Equal Area isoLatitude Pixelization of a sphere (HEALPix) . Like the S2 library, it is able to partition the sphere into a uniform grid, and has appeared in several papers for Twitter user geolocation such as . Other examples include the use of -d tree  and -means  clustering algorithms for grouping users, thus making labels as in [20, 42]. In this section, we aim at investigating the performance of MENET with respect to two label creation strategies, namely -d tree and -means subdivisions. Our experiments are again conducted on the GeoText dataset.
Following , we create groups of users using either -d tree or -means partitioning. The clustering of users is based on geographical coordinates, namely latitude and longitude. For -d tree subdivision, we make the root node with the bounding box that contains all user coordinates. Then, the tree is made by recursively splitting nodes, which correspond to boxes, into children nodes with straight dividing lines. The splitting takes into account the larger dimension of a node, then tries to divide all users in that node into two groups evenly. Note that we use only leaves to store users, which corresponds to classes. Therefore, the dividing lines must not go through any user’s point. The recursive splitting process stops if the number of users in a cell falls below a given threshold. Following , we set the theshold to resulting in geographical cells (i.e., classes). When using -means for making classes, the number of clusters is set to , and the Euclidean distance metric is used. The same hyperparameter settings are kept for MENET in these experiments.
The geolocation results on the GeoText dataset with -d tree and -means partitionings are shown in Table VIII. It is clear that -means is better than -d tree in partitioning Twitter users, in the sense that it can mitigate the geolocation errors. Concretely, using -means labels reduces the mean distance error with more than km. The median distance error reduces by % while the accuracy within km improves by roughly %. On the other hand, the performance of MENET using S2 labels is better for all the concerned performance criteria. Also, it is worth mentioning that the performance of MENET with the -means labels is close to that of S2 labels. However, the S2 partitioning is more flexible in controlling the median distance error and it is stable in creating labels compared with -means.
5 Conclusion and Future Work
Noisy and sparse labeled data make the prediction of Twitter user locations a challenging task. While plenty approaches have been proposed, no method has attained a very high accuracy. Following the multiview learning paradigm, this paper shows the effectiveness of combining knowledge from both user-generated content and network-based relationships. In particular, we propose a generic neural network model, referred to as MENET, that uses words, paragraph semantics, network topology and timestamp information, to infer users’ location. The proposed model provides more accurate results compared to the state of the art, and it can be extended to leverage other types of available information, besides the types of data considered in this paper.
The performance of our model heavily depends on user graph features. The node2vec algorithm used in this paper is transductive, meaning the graph is built on all users. In our future work, we will focus on making the model truly inductive, meaning able to generalize to never seen users.
-  Statista. (2017, Nov) Number of monthly active twitter users worldwide. [Online]. Available: https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/
-  S. S. Minab and M. Jalali, “Online analyzing of texts in social network of twitter,” in International Congress on Technology, Communication and Knowledge, 2014, pp. 1–6.
-  A. Sechelea, T. H. Do, E. Zimos, and N. Deligiannis, “Twitter data clustering and visualization,” in International Conference on Telecommunications, 2016, pp. 1–5.
-  T. Sakaki, M. Okazaki, and Y. Matsuo, “Tweet analysis for real-time event detection and earthquake reporting system development,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 4, pp. 919–931, 2013.
-  M. Komorowski, T. H. Do, and N. Deligiannis, “Twitter data analysis for studying communities of practice in the media industry,” Telematics and Informatics, 2017.
-  R. Compton, C. Lee, T. Lu, L. D. Silva, and M. Macy, “Detecting future social unrest in unprocessed twitter data: emerging phenomena and big data,” in IEEE International Conference on Intelligence and Security Informatics, 2013, pp. 56–60.
-  J. Bao, Y. Zheng, D. Wilkie, and M. Mokbel, “Recommendations in location-based social networks: a survey,” GeoInformatica, vol. 19, no. 3, pp. 525–565, 2015.
-  G. Zhao, X. Qian, and C. Kang, “Service rating prediction by exploring social mobile users’ geographical locations,” IEEE Transactions on Big Data, vol. 3, no. 1, pp. 67–78, 2017.
-  E. Celikten, G. L. Falher, and M. Mathioudakis, “Modeling urban behavior by mining geotagged social data,” IEEE Transactions on Big Data, vol. 3, no. 2, pp. 220–233, 2017.
-  X. Ji, S. A. Chun, and J. Geller, “Monitoring public health concerns using twitter sentiment classifications,” in IEEE International Conference on Healthcare Informatics, 2013, pp. 335–344.
-  M. Cha, Y. Gwon, and H. T. Kung, “Twitter geolocation and regional classification via sparse coding,” in International AAAI Conference on Web and Social Media, 2015, pp. 582–585.
-  N. T. Duong, N. Schilling, and L. S. Thieme, “Near real-time geolocation prediction in twitter streams via matrix factorization based regression,” in International Conference on Information and Knowledge Management, 2016, pp. 1973–1976.
-  J. H. Lau, L. Chi, K. N. Tran, and T. Cohn, “End-to-end network for twitter geolocation prediction and hashing,” in International Joint Conference on Natural Language Processing, 2017.
-  D. Jurgens, T. Finethy, J. McCorriston, Y. T. Xu, and D. Ruths, “Geolocation prediction in twitter using social networks: A critical analysis and review of current practice,” in International Conference on Web and Social Media, 2015, pp. 188–197.
-  J. Liu and D. Inkpen, “Estimating user location in social media with stacked denoising auto-encoders.” in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 201–210.
-  R. Priedhorsky, A. Culotta, and S. Y. D. Valle, “Inferring the origin locations of tweets with quantitative confidence,” in ACM conference on Computer supported Cooperative Work & Social Computing, 2014, pp. 1523–1536.
-  L. Backstrom, E. Sun, and C. Marlow, “Find me if you can: improving geographical prediction with social and spatial proximity,” in International Conference on World Wide Web, 2010, pp. 61–70.
-  D. Jurgens, “That’s what friends are for: Inferring location in online social media platforms based on social relationships.” in International AAAI Conference on Weblogs and Social Media, 2013, pp. 273–282.
-  R. Compton, D. Jurgens, and D. Allen, “Geotagging one hundred million twitter accounts with total variation minimization,” in IEEE International Conference on Big Data, 2014, pp. 393–401.
-  A. Rahimi, T. Cohn, and T. Baldwin, “A neural model for user geolocation and lexical dialectology,” in Meeting of the Association for Computational Linguistics, 2017, pp. 209–216.
-  Y. Bengio, I. J. Goodfellow, and A. Courville, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba, “Addressing the rare word problem in neural machine translation,” inMeeting of the Association for Computational Linguistics and the Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, 2015, pp. 11–19.
-  D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Deep learning sparse ternary projections for compressed sensing of images,” in IEEE Global Conference on Signal and Information Processing [Available: arXiv:1708.08311], 2017.
-  J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
-  L. Zhang, L. Zhang, D. Tao, and X. Huang, “On combining multiple features for hyperspectral remote sensing image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 3, pp. 879–893, 2012.
-  J. Yu, D. Liu, D. Tao, and H. S. Seah, “On combining multiple features for cartoon character retrieval and clip synthesis,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 42, no. 5, pp. 1413–1427, 2012.
-  Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International Conference on Machine Learning, 2014, pp. 1188–1196.
-  A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.
-  L. Hong, A. Ahmed, S. Gurumurthy, A. J. Smola, and K. Tsioutsiouliklis, “Discovering geographical topics in the twitter stream,” in International Conference on World Wide Web, 2012, pp. 769–778.
-  J. Eisenstein, B. O’Connor, N. A. Smith, and E. P. Xing, “A latent variable model for geographic lexical variation,” in Conference on Empirical Methods in Natural Language Processing, 2010, pp. 1277–1287.
-  H. Chang, D. Lee, M. Eltaher, and J. Lee, “@ phillies tweeting from philly? predicting twitter user locations with spatial word usage,” in International Conference on Advances in Social Networks Analysis and Mining, 2012, pp. 111–118.
-  S. Roller, M. Speriosu, S. Rallapalli, B. Wing, and J. Baldridge, “Supervised text-based geolocation using language models on an adaptive grid,” in Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 1500–1510.
-  B. P. Wing and J. Baldridge, “Simple supervised document geolocation with geodesic grids,” in Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 955–964.
-  B. Wing and J. Baldridge, “Hierarchical discriminative classification for text-based geolocation.” in Conference on Empirical Methods in Natural Language Processing, 2014, pp. 336–348.
-  F. Melo and B. Martins, “Geocoding textual documents through the usage of hierarchical classifiers,” in Workshop on Geographic Information Retrieval, 2015, pp. 7:1–7:9.
-  U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical Review E, vol. 76, no. 3, p. 036106, 2007.
-  S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly, “Video suggestion and discovery for youtube: taking random walks through the view graph,” in International Conference on World Wide Web, 2008, pp. 895–904.
-  P. P. Talukdar and K. Crammer, “New regularized algorithms for transductive learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2009, pp. 442–457.
-  C. A. D. Jr, C. L. Pappa, D. R. R. Oliveira, and F. L. Arcanjo, “Inferring the location of twitter messages based on user relationships,” Transactions in GIS, vol. 15, no. 6, pp. 735–751, 2011.
-  S. Apreleva and A. Cantarero, “Predicting the location of users on twitter from low density graphs,” in IEEE International Conference on Big Data, 2015, pp. 976–983.
-  A. Rahimi, T. Cohn, and T. Baldwin, “Twitter user geolocation using a unified text and network prediction model,” in Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2015, pp. 630–636.
-  Y. Miura, M. Taniguchi, T. Taniguchi, and T. Ohkuma, “Unifying text, metadata, and user network representations with a neural network for geolocation prediction,” in Meeting of the Association for Computational Linguistics, 2017, pp. 1260–1272.
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,”
International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.
-  Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in International Conference on Machine Learning, vol. 30, no. 1, 2013.
-  X. Pan and V. Srikumar, “Expressiveness of rectifier networks,” in International Conference on Machine Learning, 2016, pp. 2427–2435.
-  A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, C. Suen, A. Coates, A. Maas, A. Hannun, B. Huval, T. Wang, and S. Tandon. (2013) Unsupervised feature learning and deep learning. [Online]. Available: http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/
-  L. Bottou, “Stochastic gradient descent tricks,” in Neural networks: Tricks of the trade. Springer, 2012, pp. 421–436.
-  J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets. Cambridge University Press, 2011.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  J. H. Lau and T. Baldwin, “An empirical evaluation of doc2vec with practical insights into document embedding generation,” in Workshop on Representation Learning for NLP, 2016.
-  R. Rehurek and P. Sojka, “Software framework for topic modelling with large corpora,” in Workshop on New Challenges for NLP Frameworks, 2010.
Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervised learning with graph embeddings,” inInternational Conference on Machine Learning, 2016, pp. 40–48.
-  W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017.
-  M. Dredze, M. Osborne, and P. Kambadur, “Geolocation for twitter: Timing matters.” in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1064–1069.
-  B. Han, P. Cook, and T. Baldwin, “Geolocation prediction in social media data by finding location indicative words,” in International Conference on Computational Linguistics, 12 2012, pp. 1045–1062.
-  M. Shimrat, “Algorithm 112: position of point relative to polygon,” Communications of the ACM, vol. 5, no. 8, p. 434, 1962.
-  R. W. Sinnott, “Virtues of the haversine,” skytel, vol. 68, p. 158, 1984.
-  S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, 1st ed. O’Reilly Media, Inc., 2009.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2014.
-  K. G. M, E. Hivon, A. Banday, B. D. Wandelt, F. K. Hansen, M. Reinecke, and M. Bartelmann, “Healpix: a framework for high-resolution discretization and fast analysis of data distributed on the sphere,” The Astrophysical Journal, vol. 622, no. 2, p. 759, 2005.
-  J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
-  J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, 1967, pp. 281–297.