How to define co-occurrence in different domains of study?

by   Mathieu Roche, et al.

This position paper presents a comparative study of co-occurrences. Some similarities and differences in the definition exist depending on the research domain (e.g. linguistics, NLP, computer science). This paper discusses these points, and deals with the methodological aspects in order to identify co-occurrences in a multidisciplinary paradigm.



There are no comments yet.


page 1

page 2

page 3

page 4


A Geo-Gender Study of Indexed Computer Science Research Publications

This paper presents a study that analyzes and gives quantitative means f...

Digital Forensics Domain and Metamodeling Development Approaches

Metamodeling is used as a general technique for integrating and defining...

Preregistering NLP Research

Preregistration refers to the practice of specifying what you are going ...

Temporarily Unavailable: Memory Inhibition in Cognitive and Computer Science

Inhibition is one of the core concepts in Cognitive Psychology. The idea...

Computer Science Communities: Who is Speaking, and Who is Listening to the Women? Using an Ethics of Care to Promote Diverse Voices

Those working on policy, digital ethics and governance often refer to is...

Models for Narrative Information: A Study

The major objective of this work is to study and report the existing ont...

The Quotient in Preorder Theories

Seeking the largest solution to an expression of the form A x <= B is a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Determining co-occurrences in corpora is challenging for different applications such as classification, translation, terminology building, etc. More generally, co-occurrences can be identified with all types of data, e.g. databases [CAO2007343], texts [DBLP:conf/iis/RocheAMK04], images [Verma:2015:LEC:2798736.2798879], music [Ghosal2011], video [Jeon2005], etc.

The co-occurrence

concept has different definitions depending on the research domain (i.e. linguistics, NLP, computer science, biology, etc.). This position paper reviews the main definitions in the literature and discusses similarities and differences according to the domains. This type of study can be crucial in the context of data science, which is geared towards developing a multidisciplinary paradigm for data processing and analysis, especially textual data.

Here the co-occurrence concept related to textual data is discussed. Note that before their validation by an expert, co-occurrences of words are often considered as candidate terms.

First, Section 2 of this paper details the different definitions of co-occurrence according to the studied domains. Section 3 discusses and compares these different aspects based on their intrinsic definition but also on the associated methodologies in order to identify them. Finally, Section 4 lists some perspectives.

2 Co-occurrence in a multidisciplinary context

2.1 Linguistic viewpoint

In linguistics, one notion that is widely used to define the term is called lexical unit [Lederer69] and polylexical expression [Gross96]. The latter represents a set of words having an autonomous existence, which is also called multi-word expression [Sag:2002:MEP:647344.724004].

In addition, several linguistics studies use the collocation notion. [Clas94] gives two properties defining a collocation. First, collocation is defined as a group of words having an overall meaning that is deducible from the units (words). For example, climate change is considered as a collocation because the overall meaning of this group of words can be deduced from both words climate and change. On the other hand, the expression to rain cats and dogs is not a collocation because its meaning cannot be deduced from each of the words; this is called a fixed expression or an idiom.

A second property is added by [Clas94] to define a collocation. The meaning of the words that make up the collocation must be limited. For example, buy a dog is not a collocation because the meaning of buy is not limited.

2.2 NLP viewpoint

In the natural language processing (NLP) domain, the

co-occurrence notion refers to the general phenomenon where words are present together in the same context. More precisely, several principles are used that take contextual criteria into account.

First, the terms or phrases [Bourigault:1992:SGA:992383.992415, Daille:1994:TAE:991886.991975] can respect syntactic patterns (e.g. adjective noun, noun noun, noun preposition noun, etc.). Some examples of extracted phrases (i.e. syntactic co-occurrences) are given in Table 1.

In addition, methods without linguistic filtering are also conventionally used in the NLP domain by extracting -grams of words (i.e. lexical co-occurrences) [DBLP:journals/nle/MassungZ16, e46325429cf245b58edc6f687b3aac1e]. -grams are contiguous sequences of words extracted from a given sequence of text (e.g. the bi-grams111-grams with . and are associated with the text ). -grams that allow gaps are called skip--grams (e.g. the skip-bi-grams , , are related to the text

). Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships

[Mikolov:2013:DRW:2999792.2999959]. Some examples of -grams and skip--grams are given in Table 1.

After summarizing the term notion in the NLP domain, the following section discusses these aspects in the computer science context, particularly in data mining. Note that the NLP domain may be considered as being located at the linguistics and computer science interface.

Sentence (input)
With climate change the water cycle is expected to undergo significant change.
Candidates (output)

climate change
(noun noun, adjective-noun) water cycle, significant change
bi-grams of words With climate, climate change, change the, the water,
water cycle, cycle is, is expected, expected to,
to undergo, undergo significant, significant change
2-skip-bi-grams With climate, With change, With the,
climate change, climate the, climate water,
change the, change water, change cycle,
the water, the cycle, the is,
water cycle, water is, water expected,
cycle is, cycle expected, cycle to,
is expected, is to, is undergo,
expected to, expected undergo, expected significant,
to undergo, to significant, to change,
undergo significant, undergo change,
significant change

Table 1: Examples of candidates extracted with different NLP techniques.

2.3 Computer science viewpoint

In the data mining domain, co-occurring items are called association rules [Agrawal:1994:FAM:645920.672836, Yin2011] and they could be candidates for construction or enrichment of terminologies [DBLP:conf/otm/Di-JorioBFLT08].

In the data mining context, the list of items corresponds to the set of available articles. With textual data, items may represent the words present in sentences, paragraphs, or documents [Amir2005, DBLP:conf/f-egc/RabatelLPSSRL08]. A transaction is a set of items. A set of transactions is a learning set used to determine association rules.

Some extensions of association rules are called sequential patterns. They take into account a certain order of extracted elements [Jaillet:2006:SPT:1165444.1165446, Serp08] with an enriched representation related to textual data as follows:

  • objects represent texts or pieces of texts,

  • items are the words of a text,

  • itemsets represent sets of words present together within a sentence, paragraph or document,

  • dates highlight the order of sentences within a text.

There are several algorithms for discovering association rules and sequential patterns. One of the most popular is Apriori, which is used to extract frequent itemsets from large databases. The Apriori algorithm [Agrawal:1994:FAM:645920.672836] finds frequent itemsets where -itemsets are used to generate -itemsets.

Association rules and sequential patterns of words are often used in text mining for different applications, e.g. terminology enrichment [DBLP:conf/otm/Di-JorioBFLT08], association of concept instances [BERRAHOU2017115, DBLP:conf/f-egc/RabatelLPSSRL08], classification [Jaillet:2006:SPT:1165444.1165446, Serp08], etc.

3 Discussion: comparative study of definitions and approaches

This section proposes a comparison of : (i) co-occurrence definitions (see Section 3.1), (ii) automatic methods in order to identify them (see Section 3.2). This section highlights some similarities and differences between domains.

3.1 Co-occurrence extraction

The general definition of co-occurrence is finally close to association rules in data mining domain. Note that the integration of windows222Association Rule with Time-Windows (ARTW) [Yin2011]. in the association rule or sequential pattern extraction process enables us to have similarity with skip--gram extraction.

The integration of syntactic criteria makes it possible to extract more relevant candidate terms (see Table 1). Such information is typically taken into account in NLP to extract terms from general or specialized domains [JiangDTCX12, Lossio-Ventura:2016:BTE:2890328.2890347, Nenadic:2003:TMB:952532.952553, Roche2017valocarn].

Table 1 highlights relevant terms extracted using linguistic patterns (e.g. climate change, water cycle, significant change). The use of linguistic patterns tends to improve precision values. Generally other methods such as skip-bi-grams return lower precision, i.e. many extracted candidates are irrelevant (e.g. climate the). But this kind of method enables extraction of some relevant terms not found with linguistic patterns (e.g. cycle expected); then the recall can be improved.

Table 2 presents research domains related to different types of candidates, i.e. collocations, polylexical expressions, phrases, -grams, association rules, sequential patterns.

Table 3 summarizes the main criteria described in the literature. Note that the extraction is more flexible and automatic when there are fewer criteria. In this table, two types of information are associated with the different criteria. The first one (marked with ) designates the characteristics given by the co-occurrence definitions. The second type of information (marked with ) represents characteristics that are implemented in many extensions of the state-of-the-art.

Definitions Domains
Collocations L
Polylexical expressions L + NLP
Phrases NLP
n-grams NLP + CS
Association rules CS
Sequential patterns CS
Table 2: Summary of the main domains associated with expressions (L: linguistics, NLP: natural language processing, CS: computer science).
Ordered Sequences Morpho-syntactic Semantic
sequences with gaps information information
Polylexical expressions
Association rules
Sequential patterns
Table 3: Summary of the main criteria associated with co-occurrence identification. represents the respect of the criterion by definition. is present when extensions are currently used in the state-of-the-art.

Table 3 shows that the semantic criterion is seldom associated with co-occurrence definitions. This criterion is however taken into account in linguistics. For example, semantic aspects are taken into account in several studies [Heid98, Laurens99, Melcuk-etal84-99]. In this context [Melcuk-etal84-99] introduced lexical functions rely on semantic criteria to define the relationships between collocation units. For instance, a given relation can be expressed in various ways between the arguments and their values, like Centr (the center, culmination of) that returns different meanings333

  • Centr(crisis) = the peak

  • Centr(desert) = the heart

  • Centr(forest) = the thick

  • Centr(glory) = summit

  • Centr(life) = prime

In the data mining domain, semantic information is used in two main directions. The first one involves filtering the results if they respect certain semantic information (e.g. phrases or patterns where a word is an instance of a semantic resource). Other methods involve semantic resources in the knowledge discovery process, i.e. the extraction is driven by semantic information [BERRAHOU2017115].

In recent studies in the NLP domain, the semantic aspects are based on word embedding, which provides a dense representation of words and their relative meanings [Ganguly:2015:WEB:2766462.2767780, Zamani:2017:RWE:3077136.3080831].

Finally, note that several types of co-occurrence are often used in different domains. For example, polylexical expressions are commonly used in NLP and also in linguistics. In addition, -grams is currently used in NLP and computer science domains. For example,

-grams of words are often used to build terminologies (NLP domain) but also as features for machine learning algorithms (computer science domain)


Table 4 summarizes the main types of criteria (i.e. statistic, morpho-syntactic, and semantic) used for extracting co-occurrences according to the research domains considered in this paper.

Statistic Morpho-syntactic Semantic
information information information
Data mining
Table 4: Summary of the main criteria associated with research domains. represents the respect of the criterion for extracting co-occurrences from textual data. is present when extensions are currently used in the state-of-the-art.

After presenting the characteristics associated with the co-occurrence notion in a multidisciplinary context, the following section compares the methodological viewpoints to identify these elements according to the domains.

3.2 Ranking of co-occurrences

Co-occurrence identification by automatic systems is generally based on the use of quality measures and/or algorithms. This section provides two illustrative examples that show similarities between approaches according the domains.

3.2.1 Mutual Information and Lift measure

First the use of specific statistical measures from different domains is highlighted. This paragraph focuses on the study of Mutual Information (MI). This measure is often used in the NLP domain to measure the association between words [Church:1990:WAN:89086.89095]. MI (see formula (1

)) compares the probability of observing

and together (joint probability) with the probability of observing and independently (chance) [Church:1990:WAN:89086.89095].


In general, word probabilities and correspond to the number of observations of and in a corpus, normalized by the size of the corpus. Some extensions of are also proposed. The algorithm PMI-IR (Pointwise Mutual Information and Information Retrieval) described in [turney01mining] queries the Web via the AltaVista search engine to determine appropriate synonyms for a given query. For a given word, denoted , PMI-IR chooses a synonym among a given list. These selected terms, denoted , , correspond to TOEFL questions. The aim is to compute the synonym that gives the best score. To obtain scores, PMI-IR uses several measures based on the proportion of documents where both terms are present. Turney’s formula is given below (2): It is one of the basic measures used in [turney01mining]. It is inspired from MI described in [Church:1990:WAN:89086.89095]. With this formula (2), the proportion of documents containing both and (within a word window) is calculated, and compared with the number of documents containing the word . The higher this proportion, the more and are seen as synonyms.

  • computes the number of documents containing the word (i.e. corresponds to number of webpages returned by search engines),

  • (used in the ’advanced research’ field of AltaVista) is an operator that identifies if two words are present in a word wide window.

This kind of web mining approach is also used in many NLP applications, e.g. (i) computing the relationship between host and clinical sign for an epidemiology surveillance system [DBLP:journals/ijaeis/ArsevskaRHCFLD16], (ii) computing the dependency of words of acronym definitions for word-sense disambiguation tasks [DBLP:journals/informaticaSI/RocheP10].

The probabilities are generally symmetric (i.e. ), while the original MI measure is also symmetric. But the association ratio applied in the NLP domain is not symmetric, i.e. the occurrence number of pairs of words ” ” and ” ” generally differ. Moreover the meaning and relevance of phrases should differ according to the word order in a text, e.g. first lady and lady first.

Finally, MI is very close to the lift measure [Brin:1997:BMB:253260.253327, Ventura2016, Azevedo:2007:CRM:1421665.1421715] in data mining. This measure identifies relevant association rules (see formula (3)). The lift measure evaluates the relevance of co-occurrences only (not implication) and how and are independent [Azevedo:2007:CRM:1421665.1421715].


This measure is based on both confidence and support criteria, which in turn are based on association rule (

) identification. Support is an indication of how frequently the itemset appears in the dataset. Confidence is a standard measure that estimates the probability of observing

given (see formula 4).


Note that other quality measures of the data mining domain, such as Least contradiction or Conviction [Lallich2007], could be tailored to deal with textual data.

3.2.2 C-value and closed itemset

Another example is the methodological similarities associated with different approaches. For example, the C-value approach [Frantzi2000] used in the NLP domain [Lossio-Ventura:2016:BTE:2890328.2890347, JiangDTCX12] favors terms that do not appear to a significant extent in longer terms. For example, in a specialized corpus related to ophthalmology, [Frantzi2000] show that a more general term such as soft contact is irrelevant, whereas a longer and therefore more specific term such as soft contact lens is relevant. This kind of measure is particularly relevant in the biology domain [Lossio-Ventura:2016:BTE:2890328.2890347, JiangDTCX12].

In addition, in the computer science domain (i.e. data mining), the notion of closed itemset is finally very close to the C-value approach. In this context, a frequent itemset is considered as closed if none of its supersets444A superset is defined with respect to another itemset, for example {M1, M2, M3} is a superset of {M1, M2}. B is superset of A if card(A) card(B) and A B. has the same support (i.e. frequency).

This section and both illustrative examples confirm the importance of having a real multidisciplinary viewpoint on the methodological aspects, in order to build scientific bridges and thus contribute to the development of the emerging data science domain.

4 Conclusion and Future Work

This position paper proposes a discussion on similarities as well as differences in the definition of co-occurrence according to research domains (i.e. linguistics, NLP, computer science). The aim of this position paper is to show the bridges that exist between different domains.

In addition, this paper highlights some similarities in the methodologies used in order to identify co-occurrences in different domains. We could extend the discussion to other domains. For example, methodological transfers are currently applied between bioinformatics and NLP. For example, the use of edition measures (e.g. Levenshtein distance) for sequence alignment tasks (bioinformatics) v.s. string comparison (NLP).


This work is funded by the SONGES project (Occitanie and FEDER) – Heterogeneous Data Science (