Paraphrasing is often performed with less concern for controlled style conversion. Especially for questions and commands, style-variant paraphrasing can be crucial in tone and manner, which also matters with industrial applications such as dialog system. In this paper, we attack this issue with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, we expand the corpus to formal and informal sentences by human rewriting and transferring. We verify the validity and industrial applicability of our approach by checking the adequate classification and inference performance that fit with the fine-tuning approaches, at the same time proposing a supervised formality transfer task.READ FULL TEXT VIEW PDF
Style is an integral part of natural language. However, evaluation metho...
DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019)
Scarcity of parallel data causes formality style transfer models to have...
Language style transfer is the problem of migrating the content of a sou...
Style transfer is the task of automatically transforming a piece of text...
A major determinant of the quality of software systems is the quality of...
A major determinant of the quality of software systems is the quality of...
Paraphrasing, the act of using different sentences with the same meaning Bhagat and Hovy (2013), is strongly related to the text style conversion or transfer Yamshchikov et al. (2020). While prior studies often modify sentiment or offensiveness Logeswaran et al. (2018); dos Santos et al. (2018), in view of paraphrase, it should be well checked whether the core content of the sentence is maintained during the conversion process. If the sentence meaning stays the same while changing politeness or formality Rao and Tetreault (2018), we can call it paraphrasing or rewriting. Such styles can be represented in diverse ways across genre, domain, and language Jhamtani et al. (2017); Fu et al. (2018); Yang et al. (2019).
We deal with the scheme of constructing a corpus of style-variant paraphrases for directive sentences such as questions and commands, targeting the Korean language where politeness (suffix) and honorifics play a significant role in conversation Strauss and Eun (2005). Here, we consider topic and speech act as attributes constituting the directive sentence Cho et al. (2020) and construct a formal style paraphrase set using the natural language queries displaying each topic and speech act. Finally, style-variant paraphrase pairs are obtained by manual conversion to informal sentences in consideration of content preservation, and is to be released publicly as the first open text style transfer dataset in Korean. Our contributions are the following:
We present a corpus construction scheme capable of performing multiple tasks while enabling parallel sentence style transfer.
We release a Korean corpus where sentence formality style is well defined, regarding the daily used questions and commands111https://github.com/cynthia/stylekqc.
In general, sentence style222In this paper, we view ‘formality’ in Korean as a style, while interchangeably using ‘conversion’ and ‘transfer’. is handled regarding tone and manner in writing, though with a subtle difference Brooks (2020). However, previous researches about content-preserving style transfer Logeswaran et al. (2018); Tian et al. (2018)
do not seem to be only on tone in that the change in sentiment may influence the core speaker intent. Furthermore, most approaches were from the perspective of unsupervised learningdos Santos et al. (2018); Bao et al. (2019)
, with less explored fields of parallel style-variant corpus for supervised learning, which might provide robust guidance for the generative pre-trained models nowadaysRadford et al. (2019).
This trend was similarly revealed in previous studies on Korean. Since the early approaches follow the studies in English and other languages, sentiment or stance-based style transfer have prevalently been suggested Lee et al. (2019); Choi and Na (2019)333Most of the work are not in an internationally readable format; thus, we note here the methods used in the papers.. In Hong et al. (2018), the transfer regarding politeness suffix of the sentence enders was considered at the same time maintaining the sentence meaning, mainly regarding ‘hay-yo’ and ‘hap-syo
’ enders which differ in the degree of formality. However, it dealt only with the syntactic change, not the modification in the lexicon, adverbs, or tone and manner of the speech, which are all considered influential for the honorific systemStrauss and Eun (2005). In this regard, we thought that formality style transfer should be well defined along with content preservation. Furthermore, there is no open dataset for Korean style transfer that can be utilized for research and commercial purposes. We aim to resolve the above issues by proposing a straightforward and effective building scheme.
We construct a corpus of Korean directives, namely questions and commands, where the question consists of an alternative question (Alt. Q) or wh-question (wh- Q), and the command consists of prohibition (PH) and requirement (REQ), following Cho et al. (2020)
. In other words, we target four types of speech acts and assume sentences that can be uttered to humans or artificial intelligent (AI) agents. There are six topics involved in this:messenger, calendar, weather and news, smart home, shopping, and entertainment, which come from a recent survey on customers usage Lee et al. (2020). 12 workers from different backgrounds were recruited. We required specifying two likes and one dislike on the topic, and these preferences were taken into account when creating a total of 6 subgroups with two people each.
We created a construction scheme that goes through the following three steps to check its reliability while generating utterances of 5,000 per topic and 7,500 per speech act.
Writing natural language queries
Rewriting paraphrased queries in formal tone
Converting the formal sentences to informal
First, query generation is a process in which participants directly suggest the core content of directives which are to be rewritten in a formal style. In this process, participants were asked to write a natural language query for each of the given two speech acts on the assigned topic444These were given by the process managers in Cho et al. (2020), but here we let them be created by the workers to make the contents more diverse and to benefit from the preferences. Since the query structure differs by speech act type as in Cho et al. (2020), the created queries did not overlap across the workers. The queries were checked for suitability, to avoid personally identifiable stuff or the ones that can cause social harm. A total of 125 queries were generated for each (topic, act) pair. The example of queries per some (topic, act) is shown below. All the queries are generated in Korean, but described here in English for demonstrative purpose.
(Shopping, Alt. Q) The one that has better A/S between Samsung and Apple
(Entertainment, Wh-Q) The TV channel number where the news is on at 8:00 p.m.
(Messenger, PH) Not to turn on WeChat automatic update
(Smart home, REQ) To recharge the wireless vacuum cleaner in the multi-room
No particular style was considered in generation, but the workers were asked to make up the expressions that fit with colloquial context and daily life. Also, knowledge-intensive questions or queries with multiple contents were asked for modification.
Next is a process in which the workers of the subgroups exchange queries generated by each other and rewrite them into formal style sentences555In this process, the workers check the validity of the query created by each other, that the incompleteness of the queries that the moderator could have omitted can be pointed out.. We primarily asked for the formal style because there are more diverse expressions for formal utterances in the Korean language regarding indirect speech and honorifics Byon (2006), so that it is easier for paraphrasing compared to informal ones that might not come to the worker’s mind at the first place. It was required that the utterances fit with the conversation with the senior or elderly addressees rather than the friends or juniors.
Softening the commands to requests
Mentioning the addressee’s responsibility
Lessening the addressee’s burden
Asking the availability of the addressee
Some of these characteristics are shared across the culture Brown et al. (1987). It may also be exhibited similar in the East Asian society Gu (1990) and within similar syntax such as Japanese Okamoto (1999); Fukada and Asato (2004). However, we faced language-specific considerations regarding the functional and lexical expressions and asked the workers to reflect them in the construction. Simultaneously, to fit with the naturalness within colloquial context, written-style or outdated phrases/words were eschewed.
The final process is modifying directive sentences written in formal style into informal sentences. Here, the workers convert the other person’s formal sentences, created from the original query they had generated, checking the typos and misunderstandings once again. ‘Informality’ defined here is slightly different from being rude or impolite, but instead means the conversation moves towards a more comfortable and personal relationship. Rao and Tetreault (2018).
In this process, we asked the workers to maintain the overall sentence structure, of which the diversity was already obtained owing to policies in writing formal sentences. With this, we could prevent the potential overlap between the converted sentences and also guarantee ‘parallel’ data. This can be more effective in the Korean language where the indirectness is often distinguished from formality; for instance, a cautious request to a younger brother can be informal but indirect.
Style conversion was performed in various aspects such as change in sentence enders, honorifics, and lexicons (such as nation to country). The workers were encouraged to insert or delete some phrases depending on the naturalness regarding the content, and to perform at least two word-level modifications. The detailed guideline666https://docs.google.com/document/d/1gjyEMCcp0mxmdzSKdd5OrLFVikyq22OsxXHisSr2THY for the whole process was provided to the workers with the example query-sentence tuples, and we exhibit one of them (Figure 1).
The corpus was refined by three native speakers with corpus construction experience for Korean directive sentences. In this process, typos, awkward sentences, and paraphrases that are not sufficiently diverse were inspected, and the reviews were reflected by the moderators.
Through the experiment, we display that the proposed construction scheme provides a corpus that can simultaneously enable multiple tasks, which can bring advantage from a practical viewpoint.
Speech act classification
Sentence style transfer
For each of the total 24 [topic, act] chunks where we have 125 queries each, we set aside 80% (100 queries) for training, 4% (5 queries) for validation, and 16% (20 queries) for the test. From the whole dataset of volume 30,000, the training set contains 24,000 sentences and 1,200/4,800 for dev/test each. The queries were chosen randomly, and all the sets have an equal rate of topic and speech act ratio.
Topic (TOPIC) and speech act (ACT) classification are intuitively formulated. There are 5,000 utterances for each topic and 7,500 utterances for each speech act, where six topics and four speech acts are set for the labels.
Paraphrase detection (PARA) requires a sentence pair. In Cho et al. (2020), the sentence similarity was defined in 5-fold, checking if topic or speech act overlaps between the two input sentences, with the highest similarity if the queries are identical (the paraphrases). The paraphrase detection task was derived from here by formulating the multi-class problem into a binary task.
Finally, we checked whether sentence style transfer (STYLE) works using the pairs within; 12,000 pairs for training, 600 for validation and 2,400 for the test. The training was done in the way of converting the formal sentences to informal one.
Both sentence classification and paraphrase detection tasks were implemented based on a BERT-based Devlin et al. (2019) KcBERT777https://github.com/Beomi/KcBERT Lee (2020), and for sentence style transfer, KoGPT2888https://github.com/SKT-AI/KoGPT2 that bases on GPT2 Radford et al. (2019) was adopted. F1 (macro) and accuracy were used for the classification tasks, and for style transfer, we checked character edit distance (CED). The accuracy for style transfer () denotes the precision obtained with the model learned upon the train set Pang (2019). Experimental settings are provided as supplementary.
In classification and inference, we have the evaluation results that show consistency between the train and test dataset (Table 1). Considering that the queries in each set are distinguished from each other, we claim that our dataset displays the extensibility to wider world problems, also given the comprehensive coverage of topics and acts that are of interest in usual conversation and smart speaker dialogues. Though the baseline score is quite high for ACT and PARA, it does not harm one of our goals to provide a solid scheme for corpus construction that suffices practical, real-world applicability.
On STYLE, we adopted CED since formality as our ‘style’ more regards the change in suffix and some lexicons rather than the whole word order and phrase usage999On using other objective measures, the morpheme-level tokenization is not yet unified for Korean sentences, to make evaluation harder.
. Nonetheless, we found the transfer task still challenging in view of the objective measure. Instead, we observed the practical validity using the style classifier learned upon train and valid set. We qualitatively checked that the seq2seqSutskever et al. (2014) with pre-trained generative model guarantees the intended style transfer, and the detail is to be provided in Appendix A.
We have some notes on the validity of the created dataset. Primarily, though the dataset is first suggested open corpus for Korean style transfer, the granularity of the style difference within the pair is not provided here as in Rao and Tetreault (2018). Also, since our dataset provides the style transfer that maintains the overall sentence structure, some sentence pairs show minor differences, which is sufficient for spoken language processing but less robust to digitized online texts. Finally, since the formality conversion regards morpho-syntactic and lexical changes rather than the paraphrasing done in writing the formal sentences, the diversity of expression regarding the style is limited to the sentence formats that are not awkward to utter.
Despite the limitations, we want to emphasize that our approach can suggest a reliable and efficient scheme for the service providers or task managers aiming at a particular style transfer for various types of sentences. For instance, if one replaces the queries with some structured query language (SQL) or canonical forms of statements and use ‘rudeness’ or ‘twitter-likeness’ as a style, the parallel dataset can be created in the same way, though with a slightly different guideline. This kind of pair generation was done with rule or back translation in Rao and Tetreault (2018), but we believe that human-aided construction is more reliable and the resulting shortage of data can be covered with the pre-trained models for the spoken language.
In this paper, we construct and disclose the first style-variant Korean paraphrase corpus. Topic, speech act, and paraphrase are simultaneously considered in evaluating the final corpus, where the consistent composition is assumed to be guaranteed by the evaluation results. The entire guideline is currently specific to the formality transfer in Korean, but can be utilized in making up other parallel style transfer corpus with an extended pool of topics, speech acts, queries, and style. All the resources are available online101010https://github.com/cynthia/stylekqc.
In the corpus construction procedure which bases upon the documented approval of the workers, adequate compensation was paid to each of them, in all the process of query generation, writing formal sentences, and transferring them to the informal one. The participants, recruited from the social media and web, were familiar with the smart speakers and some of them had experience in corpus construction processes. For 12 participants, 250 WON ($0.22) was provided in writing each query and 200 WON ($0.18) for making up the sentences. Thus, each participant was paid 600,000 WON ($540) to make up 250 queries and write 2,500 sentences.
Our resource is free from the license issue since all the materials were created according to the guideline (a kind of template) and checked for post-processing. The outcome of our project does not contain any personally identifiable information, nor the contents that can induce social harm.
This research was supported by the Technology Innovation Program (10076583, Development of free-running speech recognition technologies for embedded robot system) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea). Also, the corpus construction was possible thanks to the help of twelve passionate participants, namely Kyung Seo Ki, Dongho Lee, Yoon Kyung Lee, Hee Young Park, Yulhee Kim, Seyoung Park, Jiwon An, Jeonghwa Cho, Kihyo Park, Kyuhwan Lee, Soomin Lee, and Minhwa Chung.
Delete and generate: Korean style transfer based on deleting and generating word n-grams.In Annual Conference on Human and Language Technology, pages 400–403. Human and Language Technology.
Content preserving text generation with attribute controls.In Advances in Neural Information Processing Systems, pages 5103–5113.
Sequence to sequence learning with neural networks.In Advances in Neural Information Processing Systems, pages 3104–3112.
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2019, page 6155. NIH Public Access.