Temporal expressions in natural language texts stand as one of the crucial pieces of information to be extracted from these texts. Accordingly, several text analysis applications, like event extractors [Ritter et al.2012] and text-based video annotation systems [Küçük and Yazıcı2011], include a temporal expression extractor as a submodule to identify, normalize and then make use of these expressions.
Traditionally, temporal expressions like some date and time expressions have been considered as named entities and have been included in the scope of named entity recognition (NER) systems. For instance, in the Message Understanding Conference (MUC) series [Grishman and Sundheim1996], which have been conducted for several years to promote research in information extraction, some date and time expressions were considered within the scope of the NER task and, in the related guidelines, these expressions are recommended to be annotated with the TIMEX tag. But within the scope of MUC, only the identification of these temporal expressions was required without the need for their normalization.
TimeML is a standard markup language for annotating temporal expressions and events [Pustejovsky et al.2003a] which is built upon previous work on the annotation of temporal expressions such as [Ferro et al.2001, Setzer2001]. According to the current TimeML guideline [Saurí et al.2005], TIMEX3 tag is used to annotate the temporal expressions identified and the normalized forms of the expressions are also specified within the annotations. Additionally, SIGNAL tag is used to annotate the temporal relations between two temporal expressions, two events, or a temporal expression and an event. There are mainly four distinct temporal expressions within the scope of TimeML: date, time, set, and duration [Saurí et al.2005]. Hence, the extent of the temporal expressions considered in TimeML is also broader compared to the extent considered in the MUC series, in addition to the normalization procedure introduced.
There are several temporal expression extraction and normalization systems, as reported in studies like [UzZaman et al.2013]. One of the initial such systems is called GUTime which is the temporal expression recognition and normalization module of a larger system called TARSQI which annotates temporal expressions, relations, and events in news texts [Verhagen et al.2005]. Several of the system proposals so far, including Edinburgh-LTG [Grover et al.2010], HeidelTime [Strötgen and Gertz2010], SUTime [Chang and Manning2013], and FSS-TimEx [Zavarella and Tanev2013]
are rule-based systems and some of them, such as HeidelTime, have been extended to extract temporal expressions in other languages including Arabic, Italian, Spanish, and Vietnamese[Strötgen et al.2014]. As previously pointed out, the extraction of some temporal expressions has long been considered a subtask of named entity recognition, and accordingly, some of the aforementioned systems like Edinburgh-LTG and SUTime are based on previous NER systems.
In addition to the system proposals, as there is a need for corpora annotated with temporal expressions, relations, and events, resources like TimeBank [Pustejovsky et al.2003b] have emerged. TimeBank has been commonly used to evaluate and compare different system proposals. Similar annotated resources have also been constructed for other languages, such as French TimeBank [Bittar et al.2011], Spanish TimeBank [Saurí and Badia2012], and Italian Timebank [Caselli et al.2011]. Such resources are indispensable for training the extraction systems proposed in addition to the common evaluation and thereby comparison of different system proposals, and to the best of our knowledge, no such annotated resource exists for Turkish.
Considering the related tools on Turkish, extraction of date and time expressions has been performed by the rule-based NER system [Küçük and Yazıcı2009] and its extended versions like [Küçük and Yazıcı2012], mostly following the named entity definition of the MUC series and extracting some deictic date expressions as well, without normalization. These experiments have been performed on diverse text genres such as news articles, historical texts, and child stories. Within a text-based semantic video annotation system, which makes use of this NER system, a separate date normalization module has been implemented to normalize only the deictic date expressions using the creation dates of the corresponding videos as reference dates [Küçük and Yazıcı2011]. Within the course of this latter study, extraction experiments are performed on automatically obtained news video texts which are mostly noisy (due to the character recognition errors introduced during the sliding text recognition procedure employed). Recently, date and time expressions are also recognized in informal texts (i.e., tweets) in Turkish using the aforementioned rule-based system, as described in [Küçük and Steinberger2014]. Another related work is presented in [Şeker and Diri2010], where the authors have considered temporal logic and event times in Turkish based on existing temporal models, yet, it does not aim to propose a temporal expression extractor or related resource for Turkish.
In this paper, we provide an analysis of the temporal expressions in Turkish, following the corresponding TimeML classification. We mainly provide several wide-coverage patterns for the extraction of these expressions together with sample expressions and their annotated forms with the TIMEX3 tag. With the presented lexicon, pattern bases, and the review of the related limited literature on Turkish, we believe that this paper can be used as a guideline before building a temporal expression extraction and normalization system for Turkish. The rest of the paper is organized as follows: In Section 2, a compact temporal lexicon in Turkish and patterns for the extraction of temporal expressions in Turkish are presented together with several samples. Section 3 lists the open issues on temporal expression extraction and normalization in Turkish texts and Section 4 concludes the paper.
2 Temporal Expressions in Turkish
Before presenting the lexical resources and patterns for temporal expressions in Turkish, we briefly summarize their two particularities in formal Turkish texts which should also be considered during system development. These writing rules are provided below, following the corresponding language rules published by Türk Dil Kurumu (‘Turkish Language Association’) [TDK2015]:
The tokens within temporal expressions are all in lowercase, except the names of the months and week days which have their initial letters capitalized. Sample expressions are bugün (‘today’), yarından sonraki gün (‘the day after tomorrow’), Pazartesi sabahı (‘Monday morning’), Mayıs ayının ikinci Pazar günü (‘the second Sunday of (the month of) May’).
The suffixes attached at the ends of the tokens of the temporal expressions are not separated from the attached suffixes. The names of the months and week days and numerals constitute the exceptions of this characteristic, as the sequence of suffixes added to the ends of these are separated from them with apostrophes. In the illustrative temporal expression, 2015 yılının Mart’ının 23’ü (‘the 23rd of March of (the year of) 2015’), the sequence of suffixes attached at the end of yıl (‘year’) is not separated from it while the ones attached at the ends of the numeral (23) and Mart (‘March’) are separated with apostrophes.
As mentioned in Section 1, there are four distinct types of temporal expressions within the scope of TimeML: date, time, set, and duration, which correspond to the value range of the type attribute of the TIMEX3 tag. In this section, we first provide a compact temporal lexicon for Turkish and in the following subsections, we present patterns for temporal expressions in Turkish and then samples conforming to these patterns.
We should note that both the lexicon and the pattern bases are nowhere near exhaustive. We have tried to devise patterns with high coverage as much as we can, yet, they are all open to modifications, corrections, and extensions especially when building practical systems for Turkish. Normalization is also not considered within the current study, a distinct set of normalization rules should be devised for the extracted temporal expressions as part of the future work.
2.1 Turkish Lexicon for Temporal Expressions
We have built the Turkish lexicon for temporal expressions with the following lexical classes. The class identifiers are given in parentheses and they are used in the ultimate extraction patterns as the building blocks.
The list of cardinal numerals from 1 to 2100, both in numbers and in words (<NUM>
), and the list of the corresponding ordinal numbers (<ORD>).
The names of days (<DAY>), that of months (<MON>), that of seasons (<SEAS>).
The names of the parts of a day, like sabah (‘morning’), akşam (‘evening’) etc. (<D-PART>).
The names of the units of time, like saat (‘hour’), gün (‘day’) etc. (<T-UNIT>).
The modifiers of temporal expressions, like gelecek (‘next’), geçen (‘last’) etc. (<MOD>).
Deictic temporal expressions like şimdi (‘now’), dün (‘yesterday’) etc. (<DEIC>).
The determiners like her (‘every’) (<DET>).
The quantifiers like kere (‘times’ as in three times a day) (<QUANT>).
The suffixes that can be attached at the ends of temporal expressions like the case (including genitive and possessive) markers, plural markers, and relativizers in Turkish (a single such suffix is denoted as <SUF>).
The apostrophe character (<APST>).
2.2 Date Expressions
Before presenting the actual patterns for date expressions (<DATE-EXPR>), we first present patterns for auxiliary constructs of <DAY-EXPR>, <MON-EXPR>, and <YEAR-EXPR> which are in turn used within the <DATE-EXPR> patterns. The patterns are presented as regular expressions where ? denotes zero or one, * denotes zero or more, denotes the OR operator and parentheses are for grouping purposes. The patterns may include both the classes of lexical entries, described in the previous section, and rarely individual entries themselves, like yıl (‘year’), sene (‘year’), ay (‘month’), gün (‘day’), and saat (‘hour’).
Though not denoted in the patterns, there are also constraints, regarding the lexical entries, that should be enforced during the utilization of the patterns. For instance, the <NUM> values within the <DAY-EXPR> should be within the range of [1..31] while the <NUM> values within the <MON-EXPR> should be within the range of [1..12].
|<DAY-EXPR> (<NUM><APST> | (<ORD> | <DAY>) gün)<SUF>*|
|<MON-EXPR> <MON><APST><SUF>* | (<ORD> | <MON>) ay)<SUF>*|
|<YEAR-EXPR> <NUM> ((yıl | sene)<SUF>*)?|
Below provided are some wide-coverage patterns for extracting date expressions in Turkish.
|<DATE-EXPR> (<NUM>.<NUM>.<NUM> | <NUM>/<NUM>/<NUM>)|
|<DATE-EXPR> <NUM>? <MON> <NUM>? <DAY>?|
|<DATE-EXPR> <YEAR-EXPR> <MON-EXPR>? <DAY-EXPR>?|
|<DATE-EXPR> <YEAR-EXPR> <NUM> (<MON><SUF>* | <MON> <DAY>?)|
|<DATE-EXPR> <MON-EXPR> <DAY-EXPR>?|
|<DATE-EXPR> <MOD>? (<T-UNIT> | <DAY> | <MON> | <SEAS>)|
Sample date instances conforming to some of these patterns are given in Table 1. In this table and the other tables in the current paper, the first column shows the Turkish samples, the second column shows their meanings in English, the third column shows the TIMEX3 annotation of the sample, and the fourth column shows the number of the pattern that the sample conforms to. For the sample in the second to last row of Table 1, the normalized value is given with respect to a reference date in the year 2015.
|Date Expression||Meaning||TIMEX3 Annotation||Pattern|
|23.03.2015||23.03.2015||<TIMEX3 tid="t1" type="DATE" value="2015-03-23">23.03.2015 </TIMEX3>||(1)|
|23 Mart 2015||March 23, 2015||<TIMEX3 tid="t1" type="DATE" value="2015-03-23">23 Mart 2015</TIMEX3>||(2)|
|23 Mart 2015 Pazartesi||March 23, 2015 Monday||<TIMEX3 tid="t1" type="DATE" value="2015-03-23">23 Mart 2015 Pazartesi</TIMEX3>||(2)|
|2015 yılının Mart’ının 23’ü||the 23rd of the March of the year 2015||<TIMEX3 tid="t1" type="DATE" value="2015-03-23">2015 yılının Mart’ının 23’ü</TIMEX3>||(3)|
|2015 yılı 23 Mart’ı||the 23rd of the March of the year 2015||<TIMEX3 tid="t1" type="DATE" value="2015-03-23">2015 yılı 23 Mart’ı</TIMEX3>||(4)|
|Mart ayının ikinci günü||the second of March||<TIMEX3 tid="t1" type="DATE" value="XXXX-03-02">Mart ayının ikisi</TIMEX3>||(5)|
|geçen sonbahar||last autumn||<TIMEX3 tid="t1" type="DATE" value="2014-FA">geçen sonbahar</TIMEX3>||(6)|
|şimdi||now||<TIMEX3 tid="t1" type="DATE" value="PRESENT_REF">şimdi</TIMEX3>||(7)|
2.3 Time Expressions
Below listed are the patterns for the common time expressions in Turkish and samples conforming to these patterns are provided in Table 2. As the final pattern denotes, some time patterns make use of date expressions extracted as well and can be recursive.
|<TIME-EXPR> <D-PART>? saat? (<NUM>.<NUM> | <NUM>:<NUM>)|
|<TIME-EXPR> <D-PART>? saat <NUM>|
|<TIME-EXPR> <DAY>? <D-PART> saat<SUF>*|
|<TIME-EXPR> <DAY>? <D-PART><SUF>*|
|<TIME-EXPR> <DATE-EXPR> <TIME-EXPR>|
|Time Expression||Meaning||TIMEX3 Annotation||Pattern|
|11.30||11.30||<TIMEX3 tid="t1" type="TIME" value="T11:30">11.30</TIMEX3>||(8)|
|sabah saat dokuz||nine o’clock in the morning||<TIMEX3 tid="t1" type="TIME" value="T09:00">sabah saat dokuz</TIMEX3>||(9)|
|sabah saatleri||morning hours||<TIMEX3 tid="t1" type="TIME" value="TMO">sabah saatleri</TIMEX3>||(10)|
|Pazartesi sabahı||Monday morning||<TIMEX3 tid="t1" type="TIME" value="XXXX-WXX-1TMO">Pazartesi sabahı</TIMEX3>||(11)|
|2 Mayıs saat 14:00||14:00 o’clock, May 2||<TIMEX3 tid="t1" type="TIME" value="XXXX-05-02T14:00">2 Mayıs saat 14:00</TIMEX3>||(12)|
2.4 Set Expressions
Below provided are common patterns for the extraction of set expressions and sample set expressions conforming to these patterns are listed in Table 3.
|<SET-EXPR> <DET> (<T-UNIT> | <DAY> | <MON> | <SEAS>)|
|<SET-EXPR> <T-UNIT><SUF> <NUM> <QUANT>?|
|<SET-EXPR> <DET>? <NUM>? <T-UNIT><SUF> <NUM> <QUANT>?|
|Set Expression||Meaning||TIMEX3 Annotation||Pattern|
|her ay||every month||<TIMEX3 tid="t1" type="SET" value="P1M" quant="EVERY">her ay</TIMEX3>||(13)|
|her Pazartesi||every Monday||<TIMEX3 tid="t1" type="SET" value="XXXX-WXX-1" quant="EVERY">her Pazartesi</TIMEX3>||(13)|
|haftada iki kez||twice a week||<TIMEX3 tid="t1" type="SET" value="P1W" freq="2X">haftada iki kez</TIMEX3>||(14)|
|her iki günde bir||once every two days||<TIMEX3 tid="t1" type="SET" value="P2D" quant="EVERY">iki günde bir</TIMEX3>||(15)|
2.5 Duration Expressions
The two patterns for the extraction of duration expressions in Turkish are given below and three related samples are provided in Table 4.
|<DURATION-EXPR> <NUM> <T-UNIT>|
|Duration Expression||Meaning||TIMEX3 Annotation||Pattern|
|iki gün||two days||<TIMEX3 tid="t1" type="DURATION" value="P2D">iki gün</TIMEX3>||(16)|
|sekiz hafta||eight weeks||<TIMEX3 tid="t1" type="DURATION" value="P8W">sekiz hafta</TIMEX3>||(16)|
|yıllar||years||<TIMEX3 tid="t1" type="DURATION" value="PXY">yıllar</TIMEX3>||(17)|
3 Open Issues
The open issues on temporal expression extraction from Turkish texts include the following:
The development of temporal expression extraction and normalization systems is an important open issue for Turkish. A convenient system can be achieved by (i) building a rule-based/learning system from scratch, or by (ii) extending an already existing and open-source temporal expression extractor, like HeidelTime [Strötgen and Gertz2010] or SUTime [Chang and Manning2013], to Turkish, or by (iii) extending an already existing Turkish NER system recognizing date and time expressions, like [Küçük and Yazıcı2009], to make it a full-fledged temporal expression extractor. Deeper examinations of these tools are definitely necessary to assess the feasibility of each option, yet, the second and the third options currently seem less labor-intensive compared to the first one.
Due to the agglutinative nature of Turkish, the tokens within the temporal expressions can have sequences of suffixes attached, as demonstrated in the proposed patterns given in the previous section. So, a convenient morphological analyzer should be considered for inclusion into the prospective systems.
In order to train and test the prospective temporal expression extraction and normalization proposals for Turkish, conveniently annotated corpora in Turkish are necessary. To the best of our knowledge, no such resource, in other words, no Turkish Timebank exists currently. Actually, this lack of annotated corpora is an issue even for the more commonly studied problem of NER on Turkish texts. The only study that describes a publicly-available Turkish corpus (of tweets) annotated with the MUC-style basic named entity types (person, location, and organization names, money and percentage expressions, along with date and time expressions) is presented in [Küçük et al.2014]. This annotated resource can be used as a starting point to build a Turkish Timebank, though it should be noted that no normalization information exists for the annotated date and time expressions in the current form of the resource.
After the developments to be carried out within the course of the previous two items above, temporal signals (to be annotated with the SIGNAL tag) and events can be included within the scopes of the system proposals to fully comply with the TimeML specifications. Thereby, a full-fledged temporal expression and event extraction system can be achieved for Turkish.
Temporal expression extraction is an important information extraction task and the corresponding extraction tools make significant contributions to larger natural language processing tasks. In this paper, we present a TimeML-based analysis of temporal expressions in Turkish as related studies on Turkish texts are quite rare. We first describe a temporal lexicon and then use the classes in the lexicon as the building blocks to devise a total of 17 wide-coverage patterns for the extraction of date, time, set, and duration expressions in Turkish. We also provide samples of temporal expressions in Turkish along with the related open issues.
- [Bittar et al.2011] André Bittar, Pascal Amsili, Pascal Denis, and Laurence Danlos. 2011. French TimeBank: An ISO-TimeML Annotated Reference Corpus. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2, pages 130–134.
- [Caselli et al.2011] Tommaso Caselli, Valentina Bartalesi Lenzi, Rachele Sprugnoli, Emanuele Pianta, and Irina Prodanof. 2011. Annotating Events, Temporal Expressions and Relations in Italian: The It-TimeML Experience for the Ita-TimeBank. In Proceedings of the 5th Linguistic Annotation Workshop, pages 143–151.
- [Chang and Manning2013] Angel X Chang and Christopher D Manning. 2013. SUTIME: Evaluation in TempEval-3. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), volume 2, pages 78–82.
[Şeker and Diri2010]
Sadi Evren Şeker and Banu Diri.
TimeML and Turkish Temporal Logic.
International Conference on Artificial Intelligence, pages 881–887.
- [Ferro et al.2001] Lisa Ferro, Inderjeet Mani, Beth Sundheim, and George Wilson. 2001. TIDES Temporal Annotation Guidelines - Version 1.0. 2. Technical report, The MITRE Corporation.
- [Grishman and Sundheim1996] Ralph Grishman and Beth Sundheim. 1996. Message Understanding Conference-6: A Brief History. In Proceedings 16th International Conference on Computational Linguistics, pages 466–471.
- [Grover et al.2010] Claire Grover, Richard Tobin, Beatrice Alex, and Kate Byrne. 2010. Edinburgh-LTG: TempEval-2 System Description. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval 2010), pages 333–336.
- [Küçük and Steinberger2014] Dilek Küçük and Ralf Steinberger. 2014. Experiments to Improve Named Entity Recognition on Turkish Tweets. In Proceedings of the EACL Workshop on Language Analysis for Social Media, pages 71–78.
- [Küçük and Yazıcı2009] Dilek Küçük and Adnan Yazıcı. 2009. Named Entity Recognition Experiments on Turkish Texts. In T. Andreasen et al., editor, Proceedings of the International Conference on Flexible Query Answering Systems, volume 5822 of Lecture Notes in Computer Science, pages 524–535.
- [Küçük and Yazıcı2011] Dilek Küçük and Adnan Yazıcı. 2011. Exploiting Information Extraction Techniques for Automatic Semantic Video Indexing with an Application to Turkish News Videos. Knowledge-Based Systems, 24(6):844–857.
- [Küçük and Yazıcı2012] Dilek Küçük and Adnan Yazıcı. 2012. A Hybrid Named Entity Recognizer for Turkish. Expert Systems with Applications, 39(3):2733–2742.
- [Küçük et al.2014] Dilek Küçük, Guillaume Jacquet, and Ralf Steinberger. 2014. Named Entity Recognition on Turkish Tweets. In Proceedings of the Language Resources and Evaluation Conference, pages 450–454.
- [Pustejovsky et al.2003a] James Pustejovsky, José Castano, Robert Ingria, Roser Saurí, Robert Gaizauskas, Andrea Setzer, Graham Katz, and Dragomir Radev. 2003a. TimeML: Robust Specification of Event and Temporal Expressions in Text. In Proceedings of the AAAI Spring Symposium on New Directions in Question-Answering, volume 3, pages 28–34.
- [Pustejovsky et al.2003b] James Pustejovsky, Patrick Hanks, Roser Saurí, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003b. The TIMEBANK Corpus. In T. McEnery, editor, Corpus Linguistics, pages 647–656.
- [Ritter et al.2012] Alan Ritter, Mausam, Oren Etzioni, and Sam Clark. 2012. Open Domain Event Extraction from Twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1104–1112.
- [Saurí and Badia2012] Roser Saurí and Toni Badia. 2012. Spanish TimeBank 1.0. Technical report, Linguistic Data Consortium (LDC).
- [Saurí et al.2005] Roser Saurí, Jessica Littman, Bob Knippen, Robert Gaizauskas, Andrea Setzer, and James Pustejovsky. 2005. TimeML Annotation Guidelines. http://www.timeml.org.
- [Setzer2001] Andrea Setzer. 2001. Temporal Information in Newswire Articles: An Annotation Scheme and Corpus Study. Ph.D. thesis, University of Sheffield, UK.
- [Strötgen and Gertz2010] Jannik Strötgen and Michael Gertz. 2010. HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 321–324.
- [Strötgen et al.2014] Jannik Strötgen, Ayser Armiti, Tran Van Canh, Julian Zell, and Michael Gertz. 2014. Time for more languages: Temporal tagging of Arabic, Italian, Spanish, and Vietnamese. ACM Transactions on Asian Language Information Processing, 13(1):1.
- [TDK2015] TDK. 2015. Türkçe Yazım Kılavuzu - Noktalama İşaretleri. available at http://tdk.gov.tr/index.php?option=com_content&view=article&id=187:Noktalama-Isaretleri-Aciklamalar&catid=50:yazm-kurallar&Itemid=132/. Last date accessed: 30-August-2015.
- [UzZaman et al.2013] Naushad UzZaman, Hector Llorens, Leon Derczynski, Marc Verhagen, James Allen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), volume 2, pages 1–9.
- [Verhagen et al.2005] Marc Verhagen, Inderjeet Mani, Roser Saurí, Robert Knippen, Seok Bae Jang, Jessica Littman, Anna Rumshisky, John Phillips, and James Pustejovsky. 2005. Automating Temporal Annotation with TARSQI. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 81–84.
- [Zavarella and Tanev2013] Vanni Zavarella and Hristo Tanev. 2013. FSS-TimEx for TempEval-3: Extracting Temporal Information from Text. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), volume 2, pages 58–63.