Real-time Automatic Word Segmentation for User-generated Text

10/31/2018 ∙ by Won Ik Cho, et al. ∙ Seoul National University 0

For readability and possibly for disambiguation, appropriate word segmentation is recommended for written text. In this paper, we propose a real-time assistive technology that utilizes an automatic segmentation. The language primarily investigated is Korean, a head-final language with the various morpho-syllabic blocks as a character set. The training scheme is fully neural network-based and extensible to other languages, as is implemented in this study for English. Besides, we show how the proposed system can be utilized in a web-based fine-tuning for a user-generated text. With a qualitative and quantitative comparison with widely used text processing toolkits, we show the reliability of the proposed system and how it fits with conversation-style and non-canonical texts. Demonstration for both languages is freely available online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The study on word segmentation222Although spacing is more appropriate for Korean, we use segmentation instead for consistency. has been done for many languages including Hindi Garg et al. (2010), Arabic Lee et al. (2003), Chinese Chen et al. (2015a), Japanese Sassano (2002), and Korean Lee et al. (2002). The requirement on segmentation depends on language; e.g., in Chinese and Japanese, it is not obligatory. However, in Korean, the spaces between words333The unit for such spacing in Korean is called eojeol (어절). It corresponds with word in English but differs in that it is a composition of morphemes rather than an inflection. are indispensable for the readability of the text Lee and Kim (2013) and possibly for the naturalness of the speech generated from the text.

Figure 1: Instagram of an elderly Korean Youtube star, Makrae Park (Image by Visualdive; http://www.visualdive.com/). The non-segmented document lacks readability in some sense, but the fluent utilization of mechanical segmentation system can be difficult for the digitally marginalized people.

Nonetheless, for the real-world language life where input through a device is common, giving a space with an additional entry can be an annoying or time-wasting process. For example, in a real-time chatting via desktop or mobile keyboards, some users tend to enter non-segmented texts which lack in readability and even sometimes cause ambiguity; this originates mainly in the annoyance, but sometimes it is a challenging issue for the digitally marginalized people such as the elderly or handicapped (Figure 1). We were inspired by the users’ needs that the readability issue is expected to be resolved with a web-based real-time automatic segmentation.

In this paper, we propose an assistive technology that aids the device users to assign proper spaces for non-segmented documents. The system is different from the correction tools which evaluate the suitability of the text concerning the standard language rules and suggest a grammatically appropriate one. In a slightly different perspective, the proposed system assigns proper spaces between the characters of the non-segmented input utterance, in a non-invasive way that avoids distortion in the user’s intention. Although the resulting segmentation is not strictly correct444In Korean, not strictly obeying the segmentation (spacing) rules is tolerated in the informal conversation-style sentences., it gives sufficient readability and can provide a naturalness in the perspective of speech production.

Not limited to Korean, the proposed approach is expected to be extended to various other languages such as Japanese and Chinese, considering the volume of the character set. Moreover, to see if the approach can be applied to the language with a much smaller character set, we show how the system is implemented for English, in terms of hashtag segmentation.

In the following section, we take a look at the literature of the word and hashtag segmentation. In Section 3 and 4, the architecture of the proposed system is described with a detailed implementation scheme, for each language. Afterward, the system is evaluated quantitatively and qualitatively with manually created datasets.

2 Related Works

2.1 Automatic word segmentation

Many studies on the automatic segmentation of Korean text have dealt with the sequential information on the history of syllables, using statistics-based approaches on the previous words, correctly predicting the space that follows for each syllabic block Lee et al. (2002, 2014). This issue is different from the studies on morphological analysis Lee (2011), though some approaches utilized the decomposition into morphemes Lee et al. (2002). In a recent implementation of KoSpacing Jeon (2018)

, convolutional neural network (CNN)

Kim (2014)

and bidirectional gated recurrent unit (BiGRU)

Chen et al. (2015b) were employed, utilizing various news articles for the training.

For Japanese, which has a similar syntax (head-final) with Korean, traditional approaches to the segmentation of text and Kanji sequence adopted statistical models Sassano (2002); Ando and Lee (2003). However, the context information does not seem to be fully utilized so far.

For the Chinese word segmentation, many contextual approaches have been suggested, although the embedding methodology and syntax are different from Korean. In chen2015gated, context information was aggregated via a gated reculsive neural network to predict the space between characters. In cai2016neural, a gated combination neural network was used over characters to produce distributed representations of word candidates.

2.2 Hashtag segmentation

Before extending the implementation on Korean to the other language, we considered what the English counterpart is. Taking into account a difference coming from the character system, we concluded that it approximately corresponds with the hashtag segmentation that is widely used for various analysis tasks targeting social media text, including Facebook comments, Tweets, and Instagram Bansal et al. (2015).

There are mainly four types of hashtags easily observed: (1) a single word, (2) an acronym, (3) a sequence of the first letter-capitalized words, (4) a sequence of non-capitalized (or all-capitalized) words. The tweets below are described in cho2018hashcount, originally from van2018semeval.
(1) It will be impossible for me to be late if I start to dress up right now. #studing #university #lazy
(2) Another great day on the #TTC.
(3) I picked a great week to start a new show on Netflix. #HellOnWheels
(4) Casper the friendly ghost on 2 levels! #thingsmichelleobamathinksareracist

For (1) and (2), hashtags can be identified given a substantial dictionary. For (3), the hashtag can be easily separated into a sequence of words with tools such as RegEx. The most complicated case is (4), where there is no absolute rule for the segmentation. Even ambiguity can be induced in hashtags such as #nothere or #alwaysahead555Suggested in http://slides.com/piyushbansal/twitter-hashtag-segmentation#/5, which can be disambiguated only by introducing the context it is contained.

The importance of using hashtag information in social media text analysis was emphasized in Chang (2010) but the study on its segmentation boosted since Berardi et al. (2011)

. Early approaches utilized lexicon-based identification and statistical models such as the Viterbi algorithm

Bansal et al. (2015); Baziotis et al. (2017), and are still used widely due to their speed and accuracy.

3 Proposed System

In this section, referring to the previous studies, we investigate the adequate architecture that generates proper segmentation for non-canonical sentences. The proposed system incorporates mainly three features as the contribution: (1) a conversation-style and non-normalized/non-standard corpus used for training, (2) a deep learning-based automatic segmentation module, and (3) the potential to be utilized in a web interface that is applicable for various social media and microblogs.

3.1 Corpus

3.1.1 Korean

The corpus consists of the dialogs in Korean drama scripts which are semi-automatically collected, with punctuations and narration removed. It contains 2M utterances of various length and involves a wide variety of topics. The number of appearing characters were found out to be around 2,500, which incorporates almost all the Korean characters that are used in real life. Several characteristics distinguish our dataset from the corpora conventionally used:
(i) Non-standardness: The script is highly conversation-style, and it contains a large portion of dialogs that do not fit with standard (Contemporary Seoul) Korean. It also includes dialects, middle age Korean, and transliteration of foreign languages (e.g., Japanese and Chinese).
(ii) Non-normalizedness: A significant portion of the sentences contain non-normalized expressions; it contains murmur, new words, slangs, incomplete sentences, etc.
(iii) Weak segmentation rules: Word (eojeol) segmentation does not strictly follow the standard Korean as in literary language, but the corpus was checked enough with the spaces those are inappropriate. The omission of the spaces was tolerated in a manner that does not induce awkwardness in the prosody of the spoken sentence.
(iv) Allows errata: Errata were not corrected if the pronunciation does not affect the semantics of the sentence; this can be a strong point for the word segmentation of the user-generated text.

3.1.2 English

For training and test, we employed the Twitter data of size 3M that was collected based on geo-location Cheng et al. (2010)666The original dataset contained 7M instances in total but about a half of those were used, for fast training and due to a hardware limitation.. With an observation that the hashtags are usually made out of the words that are frequently used for the microblogs, the raw tweet texts were utilized. Note that the tweets are not only the texts that are easily accessible but also quite conversation-style, incorporating many expressions used in real-time chats or web posts.

Figure 2: The architecture of the automatic segmentation module. The input sentence means “Mother put soup into the bottle”, but if it were not for the space flag in the third place of the inference module, the sentence could have meant “I put the skin of mother into the bottle”, which is a syntactically possible output for an ambiguous sentence but awkward. denotes the maximum character length of the input.

Before training, we had a refinement process on the raw tweets which contain various non-alphabetical nor non-digit symbols. For the refinement, the word tokenization was done with a split by space (RegEx).
(i) Removing ID tags: Since ID tags (@) are usually not related to the semantics of the tweets, and sometimes even contain lexicons those are not split (e.g., @happilyeverafter), we simply removed the tokens that start with @.
(ii) Removing URLs: We noticed that the tweets with photos accompany URLs that redirect the users to the images. We also removed the tokens that start with ’http’, as the same reason with ID tags.
(iii) Removing OOV hashtags: Since our task aims the segmentation of hashtags, many tags were just removed. However, we observed some cases where the hashtag symbol (#) is used to emphasize a specific word; for these cases, the word was preserved if the word was not out-of-vocabulary777The dictionary that is utilized is suggested afterward. (OOV). If OOV, the token was simply removed since it can interrupt the accurate prediction of the space information in the training session.
(iv) Numbers and special symbols: In hashtags, only the alphabets (upper and lower), numbers (digits 0-9), and underscores are allowed888https://www.hashtags.org/featured/what-characters-can-a-hashtag-include/. Other special symbols that are used frequently but cannot be included in the hashtags ($ % & + - * / = ¡ ¿  ) were simply replaced with a simple placeholder such as ’#’. The case for utilizing underscore are handled separately.

In summary, we obtained the dataset of size 2M and 3M for Korean and English respectively. Both corpora are depunctuated; the former incorporates over 2,500 characters and the latter consists of the lowered alphabets (26), digits (10), and a symbol placeholder (#). # was chosen since it does not appear in the middle of any hashtag. The dataset was split into the training/validation set with ratio 9:1.

3.2 Automatic segmentation module

The automatic segmentation module which is inspired by the encoder-decoder network Cho et al. (2014) consists of two submodules, respectively managing sentence encoding and the space inference.

3.2.1 Character-based sentence encoding

Korean: To clarify the symbol system of the Korean language, we introduce the Korean character Hangul and its components (alphabet) Jamo. Roughly, Ja denotes the consonants and mo denotes the vowels, making up a morpho-syllabic block of Hangul by {Syllable: CV(C)}. Note that the blocks are different from morphemes, which can be whatever form of Jamo.

For the character vector embedding of Korean, we only consider non-decomposed syllabic blocks because of the two reasons. First, the decomposition can cause ambiguity in the reconstructing process. Secondly, sufficiently many types of syllabic blocks are observed in the data-driven approach, to cover the character that appears in the real data.

Figure 3: Visual description of the system with a sample hashtag #tiredashell.

The vector embedding of the characters was done using skip-gram Mikolov et al. (2013) and subword embedding of fastText Bojanowski et al. (2016)

, not one-hot encoding. It bases on the conclusion that the sparsely embedded vectors seemed inefficient for either CNN and Bidirectional long short-term memory (BiLSTM), in the sense that they lack distributional information and even require high computation. Also, the property of distributional semantics was expected to compensate for the noise of user-generated text such as errata.

In the form of a non-segmented sequence of character vectors, the sentence is fed as an input to the CNN and BiLSTM. It finally yields two flattened vectors as output layer of each network, which are concatenated to make up an encoding of the whole input sentence (Figure 2). The specification is stated in Table 1.

Specification
Corpus
Instance of utterances
(KOR)
2,000,000 (#char: >2,500)
Instance of tweets
(ENG)
3,000,000 (#char: 37)
CNN
Input size
(single channel)
(, 100, 1)
Filters 32
Window
Conv layer: (3, 100)
Max pooling: (2,1)
# Conv layer 2
BiLSTM Input size (, 100) (KOR)
(, 37) (ENG)
Hidden layer nodes 32
MLP Hidden layer nodes 128
# Layers 2
Others Optimizer Adam (0.0005)
Batch size 128
Dropout 0.3 (for MLPs)
Activation
ReLU (MLPs, CNN,
output for KOR)
Sigmoid (output for ENG)
Loss function Mean squared error (KOR)
Binary Crossentropy (ENG)
Table 1: System specification.

was set to 100, considering the usual length of the utterances and hashtags. The input is the feature extracted from a non-segmented utterance, and the output is a binary vector that indicates appropriate segmentation. Since the architecture for Korean and English differs due to the linguistic nature, the utilization of CNN and BiLSTM was done differently for both systems. Note that the activation and loss function were chosen experimentally; for Korean, using sigmoid resulted in overfitting that caused an awkward segmentation for even a simple composite. We assume this phenomenon comes from the volume of the character set. For Korean (ReLU), the threshold was set to 0.5, and for English (Sigmoid), to 0.35.

English: Since the character set size of English corpus we exploit here differs from Korean (>2,500 vs. 37), we utilized one-hot encoding for the BiLSTM summarization. However, for the nature of CNN, the dense representation (GloVe999https://nlp.stanford.edu/projects/glove/) was used for each alphabet (and possibly the numbers or the symbol #).

In brief, we have adopted the similar architecture as in Korean for the context encoding. Roughly, this conveys the concept of our study as a language-independent contextual segmentation that can help disambiguate challenging composites. That is, the system segments the sentence into the words considering the whole composition, not only the history or the neighborhood. The specification is also stated in Table 1.

3.2.2 Decoder with a binary output

The backend of the segmentation module includes a decoder network that assigns proper segmentation for the given sequence of character vectors. With the vectorized encoding of the sentence as an input, the decoder network infers a binary sequence that indicates if there should be a space just after each character. Note that the length of the output sequence equals to where denotes the number of the characters in the input sentence. Note that the activation and loss function differ system by system (Table 1).

Figure 4: A brief sketch on the web-based segmentation service.

3.3 Application to web interface

The last feature of the proposed system is the applicability as an assistive interface for various web-based services such as Instagram and Telegram. Mainly two issues are considered.

3.3.1 Long sentences

Since the architecture deals with the sentences of fixed length , which is set 100, the inference of spaces for the long sentences requires an additional procedure. To cover such cases, we applied an overlap-hopping for the sentences whose length exceeds . The overlap length is 30 (characters), and the decision of the segmentation follows the inference which is done later. The demonstration is done in the experiment section.

3.3.2 Immediate segmentation and recommendation

The interface should be user-friendly considering the class of people the assistive technology targets. Also, since the system incorporates a deep structure, the issue of computation should not be overlooked.

Thus, we suggest a system which acts as either an immediate segmenter or a segmentation recommender, depending on the option user selects (possibly automatically set considering the user status). Since a large portion of web-based services accompany a procedure of restoring query on the server, a non-segmented text the user generates can be primarily fed as an input to the automatic segmentation module which is running online. If the user wants an immediate segmentation, the module immediately uploads a segmented version of the sentence. If the user wants a recommendation, the segmented version is returned so that the user can make a modification and confirmation on it (Figure 4). The latter can be utilized as user feedback. In real-world usage, the spaces given by the user (not a non-segmented, but with a few spaces) are also expected to be adopted as ground truth.

This user-friendly function has a lot to do with the motivation of this study. It is not concretely implemented here but is expected to be easily materialized in the user interface.

4 Experiment

4.1 Implementation

Implementation for the system was done with fastText101010https://pypi.org/project/fasttext/

(for Korean) and Keras

Chollet et al. (2015)

, which were used for character vector embedding and making neural network models respectively. With GTX Tesla M40 24GB, the training for both systems took about 30 hours; about two hours per epoch.

4.2 Evaluation

For the evaluation of Korean, two widely-used text processing tools, namely the correction module by Naver111111goo.gl/mTMLmr and KoSpacing121212Demo and model architecture are available in:
https://github.com/haven-jeon/PyKoSpacing
Jeon (2018)
were adopted. Naver does not disclose the architecture that is utilized, but seems to base on the hybrid structure of rules and statistic models using Korean word taxonomy, considering the system malfunction given the input with long and non-standard expressions (e.g., dialect and slangs). KoSpacing

bases on a deep hierarchical structure of CNN and BiGRU but adopts character n-gram and various news articles. We treat both modules as a representative of word-based and character-based toolkit respectively.

For English, a widely used segmentation toolkit, DataStories131313https://github.com/cbaziotis/ekphrasis Baziotis et al. (2017) was introduced. It is based on Viterbi algorithm Forney (1973); Segaran and Hammerbacher (2009), and is known to adopt the Twitter dataset of size 3.3M for training.

4.2.1 Quantitative evaluation

Korean - A subjective measure: Since the proposed system does not aim strictly correct segmentation, we modified the concept of mean opinion score (MOS) to fit with the data which can have multiple answers. Provided with three sentences as the output of each system, namely Naver, KoSpacing and Proposed, 14 Korean subjects were required to assign rank considering the naturalness of the segmentation, in the viewpoint of the native speaker. Assigning the same rank was allowed141414For example, (1,1,3) denotes two 1st-places and one 3rd-place. Similarly, (1,2,2) denotes two 2nd-places..

For the survey, 30 posts were chosen from the Instagram posts of Makrae Park151515https://www.instagram.com/korea_grandma (Figure 1, licensed161616It is described in the guideline that the data uploaded by the users belong to the company, and those are provided for an academic purpose. Refer to:
https://help.instagram.com/155833707900388
). The test posts are not necessarily single utterances; they contain compound sentences which incorporate scrambling, dialects, non-normalized expressions, and many errata. Since the word-based module did not respond with many of such samples, we chose the posts with which the module generated a segmented output. For the cases where the characters of the output are corrected due to the nature of the module, the result was modified so that only the spaces are assigned to the original text. Also, taking into account the output from KoSpacing and Proposed, we chose the cases where the output for all the three systems differ with each other171717The survey sheet is available in:
https://github.com/warnikchow/raws/blob/master/survey.xlsx
.

The ranking R concerning each system equals the average of the ranks for the whole utterances. System rank was normalized by:

for the number of the systems N = 3, which was designed to yield 1 for all-first place model and for the opposite case. 0.6809, 0.7142, and 0.7444 were obtained for Naver, KoSpacing, and Proposed respectively. The result does not imply the superiority regarding the accuracy of the proposed system towards the conventional toolkits; rather, it is conveyed that the proposed system yields quite tolerable segmentation for the utterances that are non-normalized and non-standard, mainly due to the characteristics of the corpus.
English - An objective measure: Different from Korean, which is agglutinative and syllable-timed, English is isolating and thus incorporates a strict boundary regarding word segmentation. This means there exists relatively apparent answer for the non-segmented utterances. Thus, for a test set, we manually excerpted 302 hashtags from van2018semeval which contains about 4,500 tweets. To check the segmentation performance targeting challenging hashtags, we chose only the chunks that consist of more than two words, all small (except the capital in the first letter) or all capital letters, possibly containing numerics.

The evaluation metric is F1-score, which is a harmonic mean of precision and recall. For input with

characters, the systems predict instances regarding the appropriateness of the segmentation. With the proposed system, we obtained 0.9336 and 0.9446 for F1-score and recall respectively. This score is a little bit unsatisfactory to the score we obtained with our test set using DataStories (F1-score: 0.9642, recall: 0.9528).

However, our model differs in that no word statistics is used; only utilized are the dense character vectors that are just real number counterpart of the one-hot encoded vectors. Also, our structure is more robust to the unseen data and typing errors, since omission of character does not seriously affect the overall prediction. For example, with an out-of-data hashtag such as ‘#mamorizatcambrige’, the baseline yields no segmentation for unseen data while the proposed model yields ‘mamoriz at cambrige’ which seems proper. Although the accuracy of segmentation does not outperform the conventional module, the proposed system has potential to be applied to segmenting real-time chats or analyzing noisy user-generated text, boosted by the flexibility of input length (Section 3.3.1) which covers the instances that cause malfunction for the early statistic-based models.

4.2.2 Qualitative analysis

Korean: We introduce two real-life utterances that display the advantages of using the proposed system. For the sentences, input was given without segmentation.
(1) 나너본지한세달다돼감
na ne pon ci han sey tal ta tway kam (informal)
It’s been almost three months since we last met.
- Naver: 나 / 너 / 본 / 지 / 한 / 세 / 달 / 다 / 돼 / 감
- Proposed: 나 / 너 / 본지 / 한 / 세달 / 다 / 돼 / 감
For (1), Naver showed the result that strictly obeys the segmentation rules, which is theoretically correct but provides little guideline on the naturalness of pronunciation. In contrast, Proposed yields an output in which the characters comprising the phrases are agglutinated, making the inter-syllable durations more natural.
(2) 아버지친구분당선되셨어요
a pe ci chin kwu pwun tang sen toy syess e yo
Father’s friend was elected. (honorific)
- KoSpacing: 아버지 / 친구 / 분당선 / 되셨어요
- Proposed: 아버지 / 친구분 / 당선 / 되셨어요
Sentence (2) incorporates an ambiguity within; if the space goes in between ‘구 (kwu)’ and ‘분 (pwun)’ instead of ‘분’ and ‘당 (tang)’, the sentence meaning changes to “Father’s friend became Bundang line181818A subway line of Korea..” which is syntactically acceptable but semantically awkward. This happens since the honorific pwun is omissible and partially because KoSpacing is trained with the news articles which is more familiar with the topics regarding subway lines than colloquial expressions.
English: Qualitative analysis for English is more straightforward since the boundary between correct and wrong is clear, thus is done with some sample hashtags in the test set. The system generally performed properly with simple colloquial tags such as #imscared191919Denoted in the parenthesis are the segmentation result. (i m scared), #ilookreallygood (i look really good), and #ithinkhehatesmesomedays (i think he hates me some days). Also, the proposed system works well with the ones containing numbers #throwdown1035 (throw down 1035) or the abbreviated terms #elenawearswyou (elena wears w you; w denotes with).

However, we observed wrong results with confusing hashtags such as #senioritisisreal (?senior it is is real; originally senioritis is real). This kind of malfunction is assumed to originate in the lack of contextual information and can be enhanced by an advanced network that takes the whole tweet as an input.

5 Conclusion

In this paper, we suggested a cross-lingually applicable word segmentation system that can be utilized both as web-based assistive technology and an analyzing toolkit for noisy user-generated text. As a first step, we refined Korean drama scripts and an English tweet dataset, making up 2M and 3M lines respectively, in the way that the segmentation results fit with the context and prosody (for Korean). Based on that, deep training architectures were constructed to yield the automatic segmenters of which the performance was verified by quantitative and qualitative comparison with widely used text processing modules. Especially for Korean, where the strict rule is not necessarily required to obey in real-time written conversation, we proposed a subjective measure that can verify the relative naturalness of the segmentation that the system assigns to user-generated texts.

The application of this system is not limited to assistive technology for the digitally marginalized and hashtag segmentation for the analysis of noisy texts, as suggested in the paper. For instance, for Korean, since the system receives a non-segmented input and generates the segmented one which fits with prosody, automatic speech recognition (ASR) or text-to-speech (TTS) modules of the smart devices may utilize the proposed function as a post- or pre-processing respectively, to create more a natural output. Since the reliability of the system concept was verified by adopting the languages with both large (Korean) and small (English) character set, it is expected to be applied smoothly to other languages that incorporate an adequate number of characters. The pretrained systems for Korean and English are available online202020https://github.com/warnikchow/raws.

Acknowledgments

This work was supported by the Technology Innovation Program (10076583, Development of free-running speech recognition technologies for embedded robot system) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea). Also, the authors appreciate ASAP of Hanyang University, Makrae Park, and Ye Seul Jung for providing the corpus, motivation, and useful example sentences.

References

  • Ando and Lee (2003) Rie Kubota Ando and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of japanese kanji sequences. Natural Language Engineering, 9(2):127–149.
  • Bansal et al. (2015) Piyush Bansal, Romil Bansal, and Vasudeva Varma. 2015. Towards deep semantic analysis of hashtags. In European Conference on Information Retrieval, pages 453–464. Springer.
  • Baziotis et al. (2017) Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017.

    Datastories at semeval-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis.

    In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 747–754.
  • Berardi et al. (2011) Giacomo Berardi, Andrea Esuli, Diego Marcheggiani, and Fabrizio Sebastiani. 2011. Isti@ trec microblog track 2011: Exploring the use of hashtag segmentation and text quality ranking. In TREC.
  • Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. CoRR, abs/1607.04606.
  • Cai and Zhao (2016) Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for chinese. arXiv preprint arXiv:1606.04300.
  • Chang (2010) Hsia-Ching Chang. 2010. A new perspective on twitter hashtag use: Diffusion of innovation theory. Proceedings of the Association for Information Science and Technology, 47(1):1–4.
  • Chen et al. (2015a) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated recursive neural network for chinese word segmentation. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    , volume 1, pages 1744–1753.
  • Chen et al. (2015b) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015b. Long short-term memory neural networks for chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1197–1206.
  • Cheng et al. (2010) Zhiyuan Cheng, James Caverlee, and Kyumin Lee. 2010. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 759–768. ACM.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Cho et al. (2018) Won Ik Cho, Woo Hyun Kang, and Nam Soo Kim. 2018. Hashcount at semeval-2018 task 3: Concatenative featurization of tweet and hashtags for irony detection. In Proceedings of the 12th International Workshop on Semantic Evaluation, SemEval-2018, New Orleans, LA, USA. Association for Computational Linguistics.
  • Chollet et al. (2015) François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
  • Forney (1973) G David Forney. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278.
  • Garg et al. (2010) Naresh Kumar Garg, Lakhwinder Kaur, and MK Jindal. 2010. Segmentation of handwritten hindi text. International Journal of Computer Applications (IJCA), 1(4):22–26.
  • Jeon (2018) Heewon Jeon. 2018. Kospacing: Automatic korean word spacing. https://github.com/haven-jeon/KoSpacing.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  • Lee et al. (2014) Changki Lee, Edward Choi, and Hyunki Kim. 2014. Balanced korean word spacing with structural svm. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 875–879.
  • Lee and Kim (2013) Changki Lee and Hyunki Kim. 2013. Automatic korean word spacing using pegasos algorithm. Information Processing & Management, 49(1):370–379.
  • Lee et al. (2002) Do-Gil Lee, Sang-Zoo Lee, Hae-Chang Rim, and Heui-Seok Lim. 2002.

    Automatic word spacing using hidden markov model for refining korean text corpora.

    In Proceedings of the 3rd workshop on Asian language resources and international standardization-Volume 12, pages 1–7. Association for Computational Linguistics.
  • Lee (2011) Jae-Sung Lee. 2011. Three-step probabilistic model for korean morphological analysis. Journal of KIISE: Software and Applications, 38(5):257–268.
  • Lee et al. (2003) Young-Suk Lee, Kishore Papineni, Salim Roukos, Ossama Emam, and Hany Hassan. 2003. Language model based arabic word segmentation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 399–406. Association for Computational Linguistics.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Sassano (2002) Manabu Sassano. 2002.

    An empirical study of active learning with support vector machines for japanese word segmentation.

    In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 505–512. Association for Computational Linguistics.
  • Segaran and Hammerbacher (2009) Toby Segaran and Jeff Hammerbacher. 2009. Beautiful data: the stories behind elegant data solutions. ” O’Reilly Media, Inc.”.
  • Van Hee et al. (2018) Cynthia Van Hee, Els Lefever, and Véronique Hoste. 2018. Semeval-2018 task 3: Irony detection in english tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation, SemEval-2018, New Orleans, LA, USA. Association for Computational Linguistics.