In every day and every place, various events are being reported in the form of texts, and many of these don’t present hierarchical and standard locations. In the context-aware text, location is a fundamental component that supports a wide range of applications. We need to focus on the normalizing location to process massive texts effectively in specific scenarios. As the text stream in social media are more quickly in accident or disaster response Munro (2011), location normalization is crucial for situational awareness in these fields, in which the omitted writing style often avoids redundant content. For example, “十陵立交路段交通拥堵 (Traffic congestion at Shiling Interchange)” refers to a definite location, but there’s no indication of where the Shiling Interchange is to make an exact response, unless we know it belongs to Longquanyi district, Chengdu city, Sichuan province.
Countries are divided up into different units to manage their land and the affairs of their people easier. Administrative division (AD) is a portion of a country or other region delineated for the purpose of administration. Due to China’s large population and area, the AD of China have consisted of several levels since ancient times. For clarity and convenience, we cover three levels in our system, and treat the largest administrative division of a country as 1st-level, next subdivisions as 2nd-level and 3rd-level, which matches the provincial (province, autonomous region, municipality, and special administrative region), prefecture-level city and county in China, shown as Table 1. China administers more than 3,200 divisions in these flattened levels. In such a large and complex hierarchy, much work stops at extracting the relevant locations, such as named entity tagging Srihari (2000)
. There are many similar named entity recognition (NER) toolkitsChe et al. (2010); Finkel et al. (2005) for location extraction. As the ambiguity is very high for location name, Li et al. (2002) and Al-Olimat et al. (2017) develop to the disambiguation of location extraction. We take a step closer to extract normalization information, and determine which the three hierarchical administrative area the document mainly describes.
The challenges are a bit different in our location normalization, which are mainly in ambiguity and explicit absence. For example, there is a duplicate Chaoyang district as 3rd-level in Beijing and Changchun city, and “Chaoyan” also means the rising sun in Chinese, which may cause ambiguity. If “Beijing” and “Chaoyang” are mentioned in the same context, it is confident that “Chaoyang” should refer to the district of Beijing city. Similarly, Yarowsky (1995) proposes a corpus-based unsupervised approach that avoids the need for costly truthed training data. However, it’s common that some contexts lack enough co-occurrence of AD to disambiguate or the explicit information completely misses. We refer to it as the explicit absence problem, and neither NER nor disambiguation makes it work unless more hidden information is explored. There are many specific AD-related points identifying which division is, including:
Location alias, e.g. “鹏城 (Pengcheng)” is the alias name of Shenzhen city;
Old calling or customary title, e.g. “老闸北 (Old Zhabei)” is a municipal district that once existed in Shanghai city;
The phrase about the spatial region event, e.g. “中国国际徽商大会 (China Huishang Conference)” has been held in Hefei city;
Some POIs (point of interest), e.g. The well-known “颐和园 (Summer Palace)” is situated in the northwestern suburbs of Beijing.
We summary them as a concept named ROI, which is both similar and different from POI. POI dataset collects specific location points that someone may find useful or interesting. It maps the detailed address that covers the administrative division. However, many POIs only build an uni-directional association with AD. For example, Bank of China as a common POI is opened across the China. We can find many Bank of China at a specific AD, but if only “Bank of China” exists in a context, we can’t directly confirm its location without more area information. Since POI is uncertain naturally, we propose the concept of ROI, which has a bi-directional association with AD. Given an ROI mapping the fixed hierarchical administrative area, ROI has high confidence to represent the area, as well as the area contains it definitely. In the absence of explicit patterns, the co-occurring ROI in the context can be good evidence to predict the most likely administrative area. The main contributions of the system are as follows, which can be applied to other languages:
We provide a structured AD database, and use the co-occurrence constraints to make a decision;
The ROIBase is equipped with geographic embeddings trained by special location sequences to make an inference;
We use a large news corpus to build a knowledge base that is made up of ROIs, which helps normalization.
2 User Interface
We design a web-based online demo 222http://research.dylra.com/v/roibase/ to show the location normalization. As shown in Figure 1, there are three cases split by blue lines, and each case mainly contains two components: query and result.
Query Input the document into the textbox with a green border to query for ROIBase. The query accepts the Chinese format sentences, such as the text from news or social media.
Result On the right of the textbox, it will show the structured result from ROIBase after submitting the query. The result consists of there parts: Confidence, Inference and ROI.
Confidence represents the result that can be extracted and identified from explicit information. For example, we have confidence to fill “新疆 (Xinjiang)” when “尉犁县 (Yuli County)” and “巴音郭楞蒙古自治州 (Bayingol Mongolian Autonomous Prefecture)” are coming together in context.
Inference is complement for the Confidence by embeddings, where the nearest uncertain administrative level will be inferred from the implicit information of the input. For example, none of the explicit administrative area appears in middle case of the Figure 1, so the Inference will start with 1st-level (the largest division), and it infers “广东省 (Guangdong Province)”. If the Confidence comes up with 1st-level, the Inference will start with 2nd-level. If the Confidence is filled with three levels, Inference does nothing and keeps it as before.
ROI is derived from the ROI knowledge base. We will match the input with the ROI knowledge base, and return the ROI associated with the administrative area when the match is successful. The types of ROI are many and varied, and what they have in common is that it build the bidirectional relation with a hierarchical AD. As shown in Figure 1, “梧桐山 (Wutong Mountain)”, the highest peak in Shenzhen city, map to three levels: [Yantian district, Shenzhen city, Guangdong province].
When the user queries, the input will be segmented into tokens by a Chinese tokenizer. Two processes are running in parallel: one is calculating the Confidence and then Inference, the other is retrieving the ROI knowledge base. The final result will be restructured back to the front in green color.
3.1 Administrative Division Co-occurrence Constraint
We support an administrative division database, including the names and partial aliases of the administrative areas in China, which are organized in the form of hierarchy. Each record is associated with its parent and children, for example, “襄阳市 (Xiangyang city)” is at 2nd-level, its alias is Xiangfan, its parent is Hubei province, and some children of its divisions are Gucheng County, Xiangzhou District, etc. we develop a co-occurrence constraint based on this database to Confidence result, shown in Algorithm 1.
Firstly, we expand the possible AD hierarchy as candidates based on the input segments, and filter the longest to next calculation. If a sentence is full of various AD information, it is probably just the listing of addresses that makes no sense, such as:
where the underlined words are related to the administrative area. The more various area-related words are, and the less certainty a sentence has. We consider the frequency of the hits as well as the penalty of other surrounding area-related words, and construct a function to accumulate the weight of each sentence for AD. Finally, we get the Confidence result based the explicit statistics.
3.2 Geographic Embeddings
We propose to train geographic embeddings by word sequences related to AD. As the location information in a document is usually only a small part, the standard name of AD are sparse and disperse, and the words related to geographic locations (now called geographic words) in a long tail are rarely seen. We don’t directly get the embedding from the raw word sequences, and instead, we assume that the raw sequences are made up of the records of AD database, geographic words, and others. To keep the former twos, we pass through a large news corpus, more than 14.3 million documents, take every phrase of news sentences that hits the AD records as a starting point, use a NER toolkit to recognise the location entities among the surrounding two sentences, and keep order to extract the candidate sequences that consist of the standard AD records and location entities. In the pattern of the NER model, it’s not extremely accurate, and various types of the phrases related to location are generically recognized. We collect the candidate sequences greater than a threshold length to train geographic embeddings.
Given a set of candidate sequences extracted from documents, each sequence is made up of the AD records and location entities, where the relative order of elements in stays the same as raw text. The aim is to learn a -dimensional real-valued embedding of each , so that the administrative area and geographic words are in the same embedding space, and the adjacent administrative areas lie nearby in the embedding space. We learn the embedding using the skip-gram model Mikolov et al. (2013) by maximizing the objective function over the set , which is defined as follows:
are the input and output vector,is the size of the sequence window, and is the vocabulary that consists of the administrative areas and geographic words.
To evaluate whether the region characteristics are captured by geographic embeddings, we design a visualization to show. Firstly, we perform k-means clustering on the learned embeddings of records in AD database, cluster 4,000+ standard AD to 100 clusters, and then plot the scatters on the map of China with the division borders, where the different colors represent different clusters and the coordinates are the rough locations of the self standard AD. As shown in Figure2, the scatters in same clusters are mainly located in same administrative area, and it means that the geographic similarity is well encoded.
Based on the Confidence result, we utilize the geographic embeddings that we train in the above section to infer the next administrative area. We first get the intersection of the input text and the geographic words , and average the embeddings of the intersection at each dimension as the representation of the input . Then we embed the latest level’s divisions of Confidence to get the candidate embeddings. For example, the Confidence ends with 2nd level, denoted as , so the embeddings of its latest level’s divisions can be denoted as , where is the number of
subdivisions. It can be observed that cosine similarities between the right candidate and geographic embedding are often higher compared to other candidates embeddings. We make theInference by as the complement of Confidence.
Since embeddings are implicit, we build an ROI knowledge base to improve interpretability and reduce the bias of Inference. Unlike traditional taxonomies that require a lot of manual labor, we propose a novel method to extract ROI from large corpus, which uses the statistics to model inconsistent, ambiguous and uncertain information it contains.
Given the geographic sequences in section 3.2, where is the geographic word, we assume that the most frequent administrative area in the window of the geographic word probably corresponds to its division. In fact, some administrative area records appear more frequently in general, such as Beijing, Shanghai and other big cites. We consider the number of the pair appears in the , where represents the administrative area name. and offset by the total count of in the whole corpus. Therefore, a similar tf–idf weighting scheme is applied to balance the exact division:
where the denotes the counting operation of the co-occurrence of and in each geographic sequence, and denotes the inverse document frequency of in all sequences for .
We score each pair and , and filter the valid pairs by a high threshold. Then the sorted mapping is obtained for each , where denotes the score weight, the higher ranks more ahead. It is noteworthy that the geographic word is not equal to ROI. We use the information entropy to filter the valid candidates:
If can’t represents the administrative area, the weights of the candidates mappings will be dispersed. The higher is, the less certain the the mapping contains. We cut off the high to keep the candidates of ROIs.
For a specific candidate ROI, it is common that the upper level of mapping will has the higher frequency than the low level in news corpus. For example, the co-occurrence of Summer Palace and Beijing is larger than the co-occurrence of Summer Palace and Haidian, and Haidian district is a subdivision of Beijing city. We base subdivision relation to correct the weight of when the is the parent division of , where .
where means the operation without , denotes the probability that only appears in but actually it belong to , denotes the sequences that are in the same document excluding , and is the Heaviside step function.
We sort the mapping again by the re-weight scheme, and get the top few pairs, which are on same orders of magnitude, to compose ROI pairs , where represent the three levels of AD and it will be set to null if one is missing. Finally, the pairs are inserted into Elasticsearch 333https://www.elastic.co engine to build the knowledge base.
|ROIBase||NER+pattern||section of text|
|-,呼和浩特市,内蒙古自治区||内蒙古||内蒙古大兴安岭原始林区雷击火蔓延… (Lightning fire spreads in the virgin forest area of the Greater Xing’an Mountains, Inner Mongolia…)|
|-,深圳市,广东省||-||日前，华为基地启用了无人机送餐业务… (A few days ago, Huawei base launched drone food delivery business…)|
|双流区,成都市,四川省||海口||四川航空3u8751成都至海口航班…安全落地成都双流国际机场… (Sichuan Airlines flight 3u8751 from Chengdu to Haikou returned and landed safely at Chengdu Shuangliu International Airport…)|
|-,丽江市,云南省||-||拍不出泸沽湖万分之一的美这个时节少了喧嚣多了闲适 (Can’t shoot one-tenth of the beauty of Lugu Lake…)|
|-,武汉市,湖北省||湖北||湖北经济学院学生爆料质疑校园联通宽带垄断性经营 (Students from Hubei University of Economics questioned campus unicom’s …)|
There are no publicly available datasets on text location normalization, so as no comparable methods. As many similar schemes about detecting location start from NER, we build NER+pattern as baseline, which uses NER to recognise and retrieves the AD database. We conduct the experiments on news and Weibo (social media in China) corpus. The news contains title and content, the title is usually short and cohesive, and the content always has hundreds of words with more location information, of which the changes lie in redundancy and efficiency. The Weibo corpus is short-text, and the location information is always implicit.
We manually sample the finance and social news, and obtain 760 news that can be assigned to a definite place to build the news dataset. Equally, 1228 short-texts are finally picked from Weibo corpus. Location information is extracted by ROIBase and NER Che et al. (2010)+pattern respectively on these datasets. As the Table 2
lists examples of the results, only NER+pattern matching can’t utilize the hidden information to completely normalize the locations, ROIBase contains 1.51 million geographic embeddings and 0.42 million ROIs, so it knows the more linking of AD by the underlined phrases.
A variant of F1 score is used to measure the performance, which takes the incomplete output as 0.5 hit when counting. As shown in Table 3, ROIBase achieves better performance against NER with AD patterns by large margins. Some of Weibo texts carry the label of location, and it contributes to the recognition of AD patterns, which closes the gap with us. The long texts provide more abundant information, and ROIBase can eliminate confusion to improve the performance.
Statistics over 100 thousand news from financial and social domains by ROIBase access to detailed results. As shown in Table 4, we can normalize locations from 36.8 percent in general. Among them, there is 23 percent normalization only at the 1st level, 48.7 percent at 2nd level, and 28.3 percent with complete divisions. We show the speed on a machine with Xeon 2.0GHz CPU and 4G Memory, and the speed of ROIBase is up to 751KB/s when the NER method Che et al. (2010) costs 14.4KB/s. ROIBase lets the user process vast amounts of long text in location normalization.
5 Related Work
formalizes the inferring location of social media into a semi-supervised factor graph model, and perform on the level of countries and provinces. A hierarchical location prediction neural networkHuang and Carley (2019) is presented for user geolocation on Twitter. However, many of these focus on a single level, only cover fewer countries or states, or utilize extra features out of text. There is room for improvement in the performance. Since Mikolov et al. (2013) proposes the word vector technique, there are many applications. Grbovic and Cheng (2018) introduces listing and user embeddings trained on bookings to capture user’s real-time and long-term interest. Wu et al. (2012) demonstrates that a taxonomy knowledge base can be constructed from the entire web in special patterns. Inspired by the these cases, we make the first solution to normalize the location of text by hierarchical administrative areas.
Through the investigation, we found that there is very few work on location normalization of text, and the popular alike solutions, such as NER, are not directly transferable to it. The ROIBase system provides an efficient and interpretable solution to location normalization through a web interface, which enables to process these modules with a cascaded mechanism. We propose it as a baseline that can be applied in different languages easily, and look forward to more work on improving the location normalization.
- Al-Olimat et al. (2017) Hussein S Al-Olimat, Krishnaprasad Thirunarayan, Valerie Shalin, and Amit Sheth. 2017. Location name extraction from targeted text streams using gazetteer-based statistical language models. arXiv preprint arXiv:1708.03105.
- Che et al. (2010) Wanxiang Che, Zhenghua Li, and Ting Liu. 2010. Ltp: A chinese language technology platform. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, pages 13–16. Association for Computational Linguistics.
- Finkel et al. (2005) Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 363–370. Association for Computational Linguistics.
- Grbovic and Cheng (2018) Mihajlo Grbovic and Haibin Cheng. 2018. Real-time personalization using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 311–320. ACM.
Huang and Carley (2019)
Binxuan Huang and Kathleen Carley. 2019.
A hierarchical location prediction neural network for twitter user
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4731–4741. Association for Computational Linguistics.
- Li et al. (2002) Huifeng Li, Rohini K Srihari, Cheng Niu, and Wei Li. 2002. Location normalization for information extraction. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Munro (2011) Robert Munro. 2011. Subword and spatiotemporal models for identifying actionable information in haitian kreyol. In Proceedings of the fifteenth conference on computational natural language learning, pages 68–77. Association for Computational Linguistics.
- Qian et al. (2017) Yujie Qian, Jie Tang, Zhilin Yang, Binxuan Huang, Wei Wei, and Kathleen M Carley. 2017. A probabilistic framework for location inference from social media. arXiv preprint arXiv:1702.07281.
- Srihari (2000) Rohini Srihari. 2000. A hybrid approach for named entity and sub-type tagging. In Sixth Applied Natural Language Processing Conference, pages 247–254.
- Wu et al. (2012) Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. 2012. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 481–492. ACM.
- Yarowsky (1995) David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pages 189–196.
- Zubiaga et al. (2017) Arkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, and Adam Tsakalidis. 2017. Towards real-time, country-level location classification of worldwide tweets. IEEE Transactions on Knowledge and Data Engineering, 29(9):2053–2066.