Intent Classification using Feature Sets for Domestic Violence Discourse on Social Media

02/21/2018
by   Sudha Subramani, et al.
Victoria University
0

Domestic Violence against women is now recognized to be a serious and widespread problem worldwide. Domestic Violence and Abuse is at the root of so many issues in society and considered as the societal tabooed topic. Fortunately, with the popularity of social media, social welfare communities and victim support groups facilitate the victims to share their abusive stories and allow others to give advice and help victims. Hence, in order to offer the immediate resources for those needs, the specific messages from the victims need to be alarmed from other messages. In this paper, we regard intention mining as a binary classification problem (abuse or advice) with the usecase of abuse discourse. To address this problem, we extract rich feature sets from the raw corpus, using psycholinguistic clues and textual features by term-class interaction method. Machine learning algorithms are used to predict the accuracy of the classifiers between two different feature sets. Our experimental results with high classification accuracy give a promising solution to understand a big social problem through big social media and its use in serving information needs of various community welfare organizations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

09/11/2020

Narratives and Needs: Analyzing Experiences of Cyclone Amphan Using Twitter Discourse

People often turn to social media to comment upon and share information ...
07/05/2018

Extracting Actionable Knowledge from Domestic Violence Discourses on Social Media

Domestic Violence (DV) is considered as big social issue and there exist...
02/21/2020

Curating Social Media Data

Social media platforms have empowered the democratization of the pulse o...
03/23/2020

Challenges of Bridging the Gap between Mass People and Welfare Organizations in Bangladesh

Computing for the development of marginalized communities is a big deal ...
05/30/2018

An English-Hindi Code-Mixed Corpus: Stance Annotation and Baseline System

Social media has become one of the main channels for peo- ple to communi...
10/31/2019

Great New Design: How Do We Talk about Media Architecture in Social Media

In social media, we communicate through pictures, videos, short codes, l...
12/06/2017

Discourse-Aware Rumour Stance Classification in Social Media Using Sequential Classifiers

Rumour stance classification, defined as classifying the stance of speci...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Domestic Violence (DV) is a global issue of pandemic proportions and vulnerable to any age group, culture, socioeconomic group, and education level. The World Health Organization estimates that 35% of women worldwide have experienced Intimate Partner Violence (IPV)

[1]. IPV refers not only to physical, but it also includes sexual, verbal, psychological and financial [2]. DV has severe and persistent effects on physical health and also have a cumulative impact on women mental health [3]. Hence, there is a burning need to better characterize and understand DV, to provide appropriate resources for victims and efficiently implement control measures.

People are increasingly using social media platforms, such as Twitter and Facebook. They share their thoughts and opinions of daily activities and happenings on these sites in a naturalistic setting. Thus generates massive amount of user data discussing various topics, even the socially stigmatized context and society tabooed topics like DV. Victims experiencing abuse are in need of earlier access to specialized DV services such as health care, crisis support, legal guidance and so on. They may not have the option or unaware to seek help directly. In social media, many social welfare and non profit organizations are encouraging the victims and survivors of DV to express their feelings and share their experience of being in an abusive relationship. Hence the social support groups for a good social cause play a leading role in creating awareness promotion and leveraging various dimensions of social support like emotional, instrumental, and informational support to the victims. When the victims seek help, it is important for those groups to identify those critical posts and provide a clear call-to-response help with more immediate impact. From the citizen-generated data, social welfare organizations with limited resources are trying to incorporate information nuggets to enrich their decision support system [4].

Intent mining provides insights that are not explicitly available from the user generated data. Intent is defined as a purposeful action and this can help to identify actionable information [5] [6]. Intent classification (focused on future action) is a form of text classification. In our work, we can apply intent classification to classify the user posts into any of the two classes as abuse or advice. If the user shared their life experience of abusive relationship, that post is classified in “abuse” class. Instead, if the post relates to awareness promotion, giving advice or opinion, it is to be classified in “advice/opinion” class.

S.No POST CLASS
1 To understand why people stay in abusive relationships. Visit the link. advice/opinion
2
If I could go back to the day that I was bickering with my first husband.
Nothing serious just small disagreement.
I was 8 months pregnant walked in the kitchen to get a drink and boom he was hiding and waiting.
It took one punch to my head I fell backwards out unconscious in a pool of blood.
abuse
3
Negative emotions like hatred destroy our peace of mind.
Guide your- self to find peace and healing within you and other people.
Hope we become better people to love and share the best in us .Bless.
advice/opinion
4
Yesterday I lost a loving and dear sweet friend.
Her life was cut short by the monster she once Loved and called her husband.
abuse
TABLE I: Example posts and its associated intent class

From the table I , we can clearly understand that the posts 2 and 4 describe the story about abusive relationship. The post 2 is shared by DV survivor and post 4 is shared by victim’s friend. The post 1 has some awareness promotion in it and post 3 expresses one’s thought/advice. Hence, they are classified as opinion/advice. Thus, our research question is “How to mine relevant social intent from an ambiguous and proliferation of unstructured textual data in abuse discourse?” By notion, the relevant intent classes meet actionable information needs of an organization in a given context [4]. There is no chance of helping the victim or victim’s family, if the message is unnoticed or ignored. Thus the paper focuses to identify abuse related posts and turn that knowledge into action that will help the victims of abuse. Thus the intent classification system in turn provide an actionable knowledge, which helps the society in turn.

With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. We observed two key challenges with intent classification on our DV discourse. First, informal language use in short-text messages creates ambiguity to interpret user expressions and thus weakening term-class relationships. Second, sparsity of instances of specific intent classes in the corpus creates data imbalance. In the binary classification, both intent classes may co-occur within a single message. For instance, post 1 in table

I may be classified into class abuse instead of class advice, as the text contains keyword abuse. Hence in our work, intent classification exploits a rich feature representation for learning, created by using knowledge sources from psycholinguistics features, and also the textual features from the underlying post. We extract the texts relating to DV from Facebook, because it allows for long text discussion and therefore use of standard English is more common.Facebook allows users to comment on posts, providing them with the ability to share their stories about abusive life pattern, give advice, and provide support. The two contributions of this work are as follows:

  • Constructing the two different and efficient feature sets by analysing the psycholinguistic dimensions and the textual features of the user postings on Facebook.

  • Evaluating the classifiers performance for identifying texts by constructing and comparing the two different feature set from DV discourse.

Analysis of the linguistic structures embedded in these posts instances provides insight the victims of domestic abuse report their personal story and they need emergency support. Trained classifiers agree with these linguistic structures, adding evidence that these social media texts provide valuable insights into DV. Intent mining classification can help to design efficient cooperative information systems between citizens and organizations for serving organizational information needs and help the victims in need.

The structure of the paper is as follows. In Section II, the related work is discussed. Section III defines the problem statement and experimental analysis is discussed in Section IV. Section V

discusses feature extraction method in detail and Section

VI

explains about classifiers and evaluation metrics. Prediction results are discussed in Section

VII. Finally, Section VIII discusses about the conclusion.

Ii Related work

With the increasing popularity of social media, the amount of information now available to decisive moment is massive. The sheer overwhelm of social media data makes it to be the one of Big Data’s most significant sources

[7] [8]. Several research studies focused on social media to analyse and predict real world activities like user sentiment analysis, opinion mining on political campaigns [9] [10], natural disasters [11], epidemic surveillance [12], event detection [13], tourism [14] [15], e-healthcare services [16][17] and so on. On the other hand, some studies dealt with security and privacy issues [18] [19] [20] [21] [22], as the increasing sophistication of social media data is posing substantial threats to users privacy. In contrast to traditional media, social media becomes a fast reaching data source to share the opinions and thoughts in online immediately with a status update. Hence, this becomes an efficient platform for researchers to detect real world events in informational retrieval and decision making.

Recent studies have predicted mental health conditions by analysing textual features [23][24][25]. For this kind of prediction, two main characteristics such as topics and language styles expressed via text have been popularly investigated. For identifying depression, linguistic styles such as an expression of sadness or the use of swear words have been used as the cues [26].

Linguistic Inquiry and Word Count (LIWC) [27] has been commonly used to capture language characteristics and they are considered to be influential predictors of depression-related disorders and mental health [28] [29]. Popular Bayesian probabilistic modelling tools, such as Latent Dirichlet allocation (LDA) are used to extract the topics [30]. LDA and its variants have been used previously to discover several mental ailments discussed in millions of tweets [31]. The authors [32]

used standard n-gram features, submission length and author attributes to classify a Reddit submission as high or low level self-disclosure. Another study states that, support seeking in stigmatized context of sexual abuse on Reddit and investigated the use of throw away accounts in the aspects of support seeking, anonymity and first time disclosures

[33]. Most relevantly, Schrading et al. [34] analyzed the lexical structures of the tweet to predict whether a victim was about staying or leaving an abusive relationship.

Though, the majority of studies assessing the role of Twitter, Facebook, and Reddit for the enormous new findings, there is limited research that focuses on online DV disclosures. Hence, this is the first study to conduct an intent classification of Facebook content that relates to abuse discourse.

Iii Problem Statement

Let denote the corpus of all Facebook posts, each post is a facebook post generated in DV community pages, and is a set of binary classes.

For each given post , predict an intent class in based on textual features extracted from post: .

To characterize the difference between two classes of DV dataset, two feature sets are extracted.

  • LIWC Features: Psycholinguistic knowledge based on LIWC [16] are used as feature sets. These features differentiate the semantic-syntactic patterns and informational context of two different classes of abuse and advice. Thus training the classifiers for intent classification and improve the accuracy.

  • Term based model: Feature sets are built based on textual features derived from Term-Class interaction with chi-square metric [24] to synthesize a more accurate classification procedure.

The model for the current study is illustrated in Fig.  1 and in the following sections, we describe the components of this model.

Fig. 1: Architecture of automated classification based on training feature sets

Iv Experimental Analysis

Effective intent mining on social media data is demanding due to ambiguity in interpretation of the textual data, and its sparsity of relevant behaviour [4]. In this paper, we address the binary classification of intent with a use-case of DV data generated in Facebook. This study uses machine learning approaches to discriminate the online posts into two categories based on two different feature set approaches. The first approach which is explained in section V-A exploits the use of psycholinguistic features generated with LIWC [27] and has the advantage to achieve higher accuracy and also being computationally efficient. Another approach explained in section V-B constructs the feature set based on bag-of-tokens [35], which uses tf.idf approach combined with chi-squared algorithm [36]. Thus, the classifier is finally trained with higher accuracy based on the constructed feature set.

Iv-a Data Collection

For the current study, we extracted the data from Facebook pages which is focusing on specific theme like DV using NCapture[37]. The dataset contains 8856 posts and 28873 comments and extracted posts are ranging from historical to current date (2014 to 2017). The data extracted contains information about post content, user name, link description, comment text, commenter username, created time, and likes. Table II illustrates our dataset details.

DV Page Source Name Posts Comments Page Likes
Stop DV 3597 13462 63k
DV Survivors 2816 8415 32k
Stop DV against Women 516 2802 7.5k
DV Awareness Month 1927 4194 15k
TABLE II: Details of DV Facebook page name, posts, comments and likes*(till date*)

Iv-B Extracting Gold Standard Labels

Two of the authors annotated a random sample of 1135 posts to classify the corpus and labelled them as abuse and opinion/advice based on textual information. Hence 510 posts are classified in first class, i.e 45% and 625 posts are labelled in second class which is 55%. The kappa’s coefficient was good between the authors and it was 0.85. To measure the “inter-rater reliability”, a statistical measure Kappa coefficient is used to measure the degree of agreement between two users. Any discrepancies between the authors are clarified after further discussion.

Iv-C Data Pre-processing

For the effective analysis, text pre-processing is the most important step, as it removes noise that produces negatives effects and degrades performance. High quality information and features are extracted by incorporating some pre-processing techniques like stop word removal and some normalization techniques like stemming, lemmatization and so on. First, we removed stop words that are not included for the content analysis. Stop word lists contain common English words like articles, prepositions, conjunctions, pronouns, etc., Few examples are the, a, an, the, in, and at. Next, pre-processing step is lemmatization. This is used to reduce more inflectional forms of words into a more limited form of canonical forms. This helps to standardize terms appearance and to reduce data sparseness. For example, the following terms such as physically assaulted, physically assaults and physically assaulting are all lemmatized to “physical assault”. Similarly physically abusing, physically abused, physically abuses are all lemmatized to “physical abuse”.

V feature Extraction

Feature Extraction is one of the most important steps in data mining and knowledge discovery. The idea is to select the best features that improve the classification accuracy. In the following sections V-A and V-B, the two different feature extraction techniques used in this work is discussed.

V-a Methodology 1: Psycholinguistic Features Analysis

We examine and analysed the proportions of word usage in psycholinguistic categories as defined in the LIWC 2015 package [27]. The LIWC analyses text on a word-by-word basis and calculates the percentages of words that match particular word categories.

For each given post corpus, predict an intent class in based on LIWC features extracted from post defined as . Here denotes the quantity of specific psycholinguistic feature in post

LIWC package is a psycholinguistic lexicon created by psychologists with focus on identifying the various emotional, cognitive, and linguistic components present in individuals’ verbal or written communication. For each input of a post, it returns more than 70 output variables with higher level hierarchy of psycholinguistic features such as

  • linguistic dimensions, other grammar, informal language

  • temporal, affective, social processes

  • cognitive, biological, perceptual processes

  • personal concerns, drives, relativity.

These higher level categories are further specialized in sub-categories such as in

  • biological processes - body, sexual, health and ingestion.

  • affective processes - positive emotion, negative emotion and negative emotion further sub-classified as anger, anxiety, and sadness.

  • drives - affiliation, acheivement, power, regard, and risk.

For evaluating the prediction accuracy of psycholinguistic features, each individual facebook post is converted to a vector of 70 output numerical variables, as mentioned above. Each output variable represents the frequency distribution of the appearance of those categories appeared in the specific post. Each word in the post could fit some categories and not fit into some categories. Hence, there would be the huge difference between the posts, to which category it belong. For instance, the following post “

Please view, share and is possible donate. We appreciate your support! ” has higher value of ‘positive emotion’ (36.36%), ‘focus present’ (36.36%), ‘you’ (9.09%), ‘social’ (27.27%) and has (0%) for the categories such as ‘negative emotion’, ‘shehe’, ‘bio’, ‘body’. The above post falls into “advice/opinion” category, as it creates a good social cause and for fund-raising, and also has higher percentage of positive expression and present focus in it. In contrast, the post “He is just an evil, greedy, arrogant little man” has higher percentage of ‘negative emotion’(40%), ‘anger’ (10%) , ‘male’(20%), ‘shehe’(10%) and (0%) of ‘posemo’, ‘you’, ‘death’. This post falls into “abuse” class, as woman explaining about her abusive partner and the post carries a lot of negative emotion in it.

V-A1 Most informative features and analysis

We only selected the top-most informative 15 features, as shown in table III to perform the binary classification task. We selected those features based on the mean value of all the posts, which is defined in table IV. We selected the feature sets that are strong predictors of two classes such as “advice/opinion” and “abuse”.

LIWC sub-categories such as ‘negative emotion’, ‘anger’, ‘shehe’, ‘focuspast’ are features with higher mean value and good prediction level for “abuse” category. ‘Positive emotion’, ‘focus present’, ‘you’, ‘focus future’ sub-categories, as expected, are good predictors for “opinion/advice” class. Another important prediction is that ‘health’, ‘sexual issues’, and personal concern such as ‘death’ are as good predictors for “abuse” class. The results infer that, because of the abusive cycle, most of the victims suffer from severe health issues and also death. Further analysis show that posts related to “abuse” category are often self-reflective, with more words related to personal pronouns i.e, usage of more pronouns such as ‘I’ and ‘shehe’, when describing their life experience about violence, whereas in the “opinion/advice” category, person usage ‘you’ is higher, when giving advice or sharing opinion to other people. It is important to compare the time orientations, the posts of “abuse” category are more focused on ‘past’ and contains negative emotions with expression of ‘anxiety’, ‘angry’ and ‘sad’. On the other hand, the “advice” category contains more of positive emotion, as sharing of good thoughts and opinions and more time orientated towards ‘present’ and ‘future’.

Category Dimension Example words
Linguistic Dimensions
personal
pronouns(I,you,shehe)
I, you, he,she ,his, him, her,herself
Time orientations
focuspast
focuspresent
focusfuture
broke,ran, accepted
supports,trust, likes
plan,wish, hopeful
Biological Processes
body
sexual
health
muscles, injury, fat
rape, lust, abortion, pregnant
sick, weak, painful, bleed
Psychological Processes
posemo
negemo
anxiety
anger
sad
hope,share,support,like
threat,lose,hate
threat,misery,worry
sucks,hate,yell
miss,lose,suffer,overwhelm
Personal Concern death die, murder, kill, suicide, bury
TABLE III: LIWC features and the example words used in our dataset

The posts related to “abuse” category has higher mean values of the corresponding features such as ‘I’, ‘shehe’, ‘focuspast’, ‘body’, ‘sexual’, ‘health’, ‘death’ and ‘negative emotions’. The posts related to ”opinion/advice” scores high in features such as you, focuspresent, focusfuture, posemo. For eg., if we consider Shehe feature, the posts belongs to ”abuse” category have the highest mean value of 10.98, whereas the posts of “opinion/advice” category have the lowest mean value of 0.39. This implies, when the victims post their story, they need to say more about the abuser and thus used more third person pronoun ‘shehe’. In the ”opinion/advice” class, the people just express their thoughts and not necessarily use third person pronoun.

Features Abuse () Advice() Features Abuse () Advice()
I 3.15 2.02 Health 1.01 0.61
You 0.59 4.01 Death 1.19 0.02
Shehe 10.98 0.39 Posemo 2.57 10.73
Focuspast 7.09 0.10 Negemo 4.78 1.55
Focuspresent 7.11 14.34 Anxiety 0.65 0.09
Focusfuture 0.96 1.55 Anger 2.28 0.57
Body 0.84 0.42 Sad 0.58 0.14
Sexual 0.34 0.09
TABLE IV: Mean Scores of Psycholinguistic Processes (LIWC) for the posts of 2 categories

V-B Methodology 2: tf.idf approach with chi-squared metric ( Statistic) as feature selection parameter

  • In Term Document Matrix dimension, indicates the total number of posts and indicates no. of terms. Thus indicates the corresponding matrix. Two common features such as and used, where represents the number of times the term appears in post and denotes the number of posts contain term .

    weighting scheme improves the discriminative power, where with is the inverse document frequency. In this work, a term t will be selected if it has high value and high average values of and thus high across all

    Facebook posts over a threshold. Term-Class interaction based selection method is also used which capture the dependence between terms and corresponding class labels during the feature selection process.

  • The feature selection technique based on chi-squared on the entire term document matrix is used to compute chi-squared value corresponding to each word.

    The Chi-square statistical test has been widely accepted as a statistical hypothesis test to evaluate the dependency among two variables [24]. In natural language processing, Chi-squared is generally used to measure the degree of dependence between term

    and label

    and compared to the distribution with one degree of freedom. The expression for (

    Statistic) is defined as

    (1)

    = the total number of posts.
    = the number of posts of label l containing term t.
    = number of posts containing t occurring without l.
    = number of label l occurring without t.
    = number of posts of other classes without t.

Vi Classifiers and Evaluation Metrics

We now pursue the use of supervised learning to construct classifiers trained to predict the class label of the posts. Although we analysed results for both all dimension-inclusive and dimension-reduced cases, we employ principal component analysis (PCA) to avoid over-fitting for all the classifiers. We compare several different classifiers such as Support Vector Machine, k-Nearest Neighbor, Naive bayes, and Decision Tree to empirically determine the best suitable classification technique. The problem of binary text classification problem is generally defined as follows:

Our training set corpus of posts, such that each post is labelled with a class of either . The task of a classifier is to find the corresponding label for each post.

(2)

For all of our analyses, we use 10-fold cross validation and leave one out methods to assess the effectiveness of the model.

  • Support Vector Machines (SVM): a non-probabilistic generalized and linear binary classifier [38]

    .This maps an input feature vector into a higher dimensional space and find a hyperplane that separates the data into two classes with the maximal margin between the closest samples in each class.


  • Naive Bayes (NB): a probabilistic method [39]

    for text classification is familiar for its robustness and relative simplicity. This classifier constructs the conditional probability distributions of underlying features given a class label from the training data only. The classification on unseen data is then performed by comparing the class likelihoods

    [40] [41].

  • Decision Tree (DT): an interpretable classifier [42] creates the hierarchical tree of the training instances, in which a condition on the feature value is used to divide the data hierarchically. For the classification of text documents, the conditions on decision tree nodes are commonly defined in terms and a node may be subdivided to its children based on the presence or absence of a particular term in the document. Ensemble methods use multiple learning algorithms of decision tree for better predictive performance [43] [44] [45].

  • k-Nearest Neighbor (k-NN): a proximity-based classifier [46] use distance-based measures i.e., the documents which belong to the same class are more likely similar or close to each other based on the similarity measures. The classification of the test document is reported from the class labels of the k nearest similar documents in the training set.

We have performed various runs with different feature set in the above defined classifiers. We used the common metrics such as Precision, Recall, F-Measure, and Accuracy to evaluate the classification performance. Precision measures the percentage of the Facebook posts that the classifier predicted (i.e., the classifier labeled as positive) that are in fact positive (i.e., are positive according to the human gold labels). Recall measures the percentage of posts actually present in the gold label that were correctly identified by the Classifier. F-measure comes from a weighted harmonic mean of precision and recall. Accuracy calculates the percentage of correctly classified posts versus. Number of total posts. All the metrics are defined as follows.

(3)
(4)
(5)
(6)

Vii Prediction Results and Discussion

Vii-a LIWC analysis

Our chosen most informative features of LIWC have higher accuracy in prediction of two different classes abuse and advice. Among all the classifiers, SVM outperforms all the other classifiers. The classification accuracy of SVM, kNN and decision tree are 97%, 95.3%, 95.1% respectively.

Table V and Fig. 2 show the various evaluation metrics of SVM classifier with the combination of various selected features. Higher the value of accuracy, the selected features are very good in prediction of classes of our model.

SVM Classifier features set Precision(%) Recall(%) F-Measure(%) Accuracy(%)
Linguistic Dimensions
(3 features)
96 91 93 94
Time orientations
(3 features)
98 86 91 92
Biological Processes +
personal concern
(4 features)
89 42 57 68
Psychological Processes
(5 features)
83 86 84 84
Selected LIWC features
(15 features)
97 96 96 97
TABLE V: Confusion Matrix of various combinations of LIWC features
Fig. 2: SVM classifier’s performance metrics on different feature sets

Parallel coordinates plot as shown in Fig. 3, with all the selected features that separates the class value best. For example, “posemo, focuspresent, you” are the features best classifying the class to be in “advice/opinion”, which is plotted in blue color. The orange color plot explains the class to be in “abuse”, with the selective features such as “shehe, focuspast and death”. We can understand that, when the victim or survivor posts about abusive experience, they use more past tense and health concern. ‘Shehe’ notion also widely used to represent the abuser. Whereas, in the case of advice/opinion category, the linguistic style contain present tense and future tense, as it is more focused on future life and well-being.

Fig. 3: parallel coordinates plot for abuse and advice class

Vii-B Term-Class Interaction model

Our final model contains 300 features based on tf.idf vector and the total number of features is reduced to top 250 features based on the chi-squared value. Hence the feature space is significantly reduced. Finally standard Classifiers such as naive-bayes and k-NN are applied to classify the posts in the corresponding class and the results of the classifiers are compared with respect to standard evaluation metrics precision, recall and accuracy. Among the validation methods of leave one out and k folds cross validation, 10 folds cross validation is used in the final model for the better evaluation of the predictive accuracy. Our result in Fig. 4. shows that classification accuracy of NB is 82% with 10 folds cross validation, which outperforms the classification accuracy of kNN.

Fig. 4: Precision, Recall and Classification accuracy of naive bayes and kNN (** 10 folds and * leave one out)

The term-class interaction values using chi-squared test is shown in Table VI. We can clearly see from Table VI, that terms, such as share, support, page, thank, are highly associated with class “opinion/advice”, whereas the terms, such as kill, husband, murder, leave, are more dependent with the class “abuse”.

Term t Chi-Squared value Predicted class Term t Chi-Squared value Predicted class
Leave 128.88 Abuse Break 63.02 Abuse
Kill 127.12 Abuse Murder 61.5 Abuse
Husband 98.63 Abuse Share 21.09 Opinion
Abuse 90.15 Abuse Thank 20.99 Opinion
Lose 75.01 Abuse Support 16.62 Opinion
TABLE VI: The selected terms with the Chi-squared value for predicted class

Utilizing the property of statistic, it is inferred that higher the value of term t, indicates the higher likelihood of occurrence in the class c. Thus we use metric to weight the context words in the tf.idf model. The key aspect is that words with higher statistics tend to be keywords for class identification. Hence we applied chi-square statistical test to select the lexicon that particularly correlates to the specific class identification task for user posts. In this work, words that are likely to be valuable for the classification task are more heavily weighted based on metric, and hence reducing the disturbance of the noise words which are not helpful comparatively to the later task. The Fig. 5 shows, each term’s probability in predicting the corresponding class. For example, the words, such as share and support belongs to class “advice/opinion” (which is represented as 0). The words, such as kill, murder, leave, predicts the text to be in abuse class (represented as 1).

Fig. 5: Visualization graph of the term-class interaction based feature selection method

Viii Conclusion

The results of this study demonstrated that the linguistic dimensions and textual features discussed in the user posts have the potential to classify the text into appropriate class. The experimental results highlighted that psycholinguistic clues have strong indicative powers in the prediction of posts than textual features. By interpreting the use of proposed intent mining classification models, social support groups on Facebook can quickly identify DV victims via text posted on Facebook and appropriate support can be provided.

References

  • [1] C. García-Moreno, C. Pallitto, K. Devries, H. Stöckl, C. Watts, and N. Abrahams, Global and regional estimates of violence against women: prevalence and health effects of intimate partner violence and non-partner sexual violence.   World Health Organization, 2013.
  • [2] W. H. Organization et al., “Understanding and addressing violence against women: Intimate partner violence,” 2012.
  • [3] L. M. Bromfield, A. Lamont, R. Parker, B. Horsfall et al., “Issues for the safety and wellbeing of children in families with multiple and complex problems,” 2010.
  • [4] H. Purohit, G. Dong, V. Shalin, K. Thirunarayan, and A. Sheth, “Intent classification of short-text on social media,” in Smart City/SocialCom/SustainCom (SmartCity), 2015 IEEE International Conference on.   IEEE, 2015, pp. 222–228.
  • [5] I. Varga, M. Sano, K. Torisawa, C. Hashimoto, K. Ohtake, T. Kawai, J.-H. Oh, and S. De Saeger, “Aid is out there: Looking for help from tweets during a large scale disaster.” in ACL (1), 2013, pp. 1619–1629.
  • [6] H. Purohit, C. Castillo, F. Diaz, A. Sheth, and P. Meier, “Emergency-relief coordination on social media: Automatically matching resource requests and offers,” First Monday, vol. 19, no. 1, 2013.
  • [7] H. Wang, X. Jiang, and G. Kambourakis, “Special issue on security, privacy and trust in network-based big data,” Information Sciences: an International Journal, vol. 318, no. C, pp. 48–50, 2015.
  • [8] Y. Qin, Q. Z. Sheng, N. J. Falkner, S. Dustdar, H. Wang, and A. V. Vasilakos, “When things matter: A survey on data-centric internet of things,” Journal of Network and Computer Applications, vol. 64, pp. 137–153, 2016.
  • [9] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe, “Predicting elections with twitter: What 140 characters reveal about political sentiment.” Icwsm, vol. 10, no. 1, pp. 178–185, 2010.
  • [10] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith, “From tweets to polls: Linking text sentiment to public opinion time series.” ICWSM, vol. 11, no. 122-129, pp. 1–2, 2010.
  • [11] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: real-time event detection by social sensors,” in Proceedings of the 19th international conference on World wide web.   ACM, 2010, pp. 851–860.
  • [12] R. Chunara, J. R. Andrews, and J. S. Brownstein, “Social and news media enable estimation of epidemiological patterns early in the 2010 haitian cholera outbreak,” The American journal of tropical medicine and hygiene, vol. 86, no. 1, pp. 39–45, 2012.
  • [13] S. Petrović, M. Osborne, and V. Lavrenko, “Streaming first story detection with application to twitter,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics.   Association for Computational Linguistics, 2010, pp. 181–189.
  • [14] H. Q. Vu, G. Li, R. Law, and B. H. Ye, “Exploring the travel behaviors of inbound tourists to hong kong using geotagged photos,” Tourism Management, vol. 46, pp. 222–232, 2015.
  • [15] H. Q. Vu, G. Li, R. Law, and Y. Zhang, “Travel diaries analysis by sequential rule mining,” Journal of Travel Research, p. 0047287517692446, 2017.
  • [16] L. Sun, H. Wang, J. Soar, and C. Rong, “Purpose based access control for privacy protection in e-healthcare services,” Journal of Software, vol. 7, no. 11, pp. 2443–2449, 2012.
  • [17] L. Sun, H. Wang, J. Yong, and G. Wu, “Semantic access control for cloud computing based on e-healthcare,” in Computer Supported Cooperative Work in Design (CSCWD), 2012 IEEE 16th International Conference on.   IEEE, 2012, pp. 512–518.
  • [18] H. Wang, Z. Zhang, and T. Taleb, “Special issue on security and privacy of iot,” World Wide Web, pp. 1–6, 2017.
  • [19] X. Sun, H. Wang, J. Li, and Y. Zhang, “Satisfying privacy requirements before data anonymization,” The Computer Journal, vol. 55, no. 4, pp. 422–437, 2012.
  • [20] J. Li, H. Wang, H. Jin, and J. Yong, “Current developments of k-anonymous data releasing,” Electronic Journal of Health Informatics, vol. 3, no. 1, p. 6, 2008.
  • [21] H. Wang, J. Cao, and Y. Zhang, “A flexible payment scheme and its role-based access control,” IEEE Transactions on knowledge and Data Engineering, vol. 17, no. 3, pp. 425–436, 2005.
  • [22] H. Wang and L. Sun, “Trust-involved access control in collaborative open social networks,” in Network and System Security (NSS), 2010 4th International Conference on.   IEEE, 2010, pp. 239–246.
  • [23] T. Nguyen, T. Duong, S. Venkatesh, and D. Phung, “Autism blogs: Expressed emotion, language styles and concerns in personal and community settings,” IEEE Transactions on Affective Computing, vol. 6, no. 3, pp. 312–323, 2015.
  • [24] T. Nguyen, D. Phung, B. Dao, S. Venkatesh, and M. Berk, “Affective and content analysis of online depression communities,” IEEE Transactions on Affective Computing, vol. 5, no. 3, pp. 217–226, 2014.
  • [25] T. Nguyen, B. O’Dea, M. Larsen, D. Phung, S. Venkatesh, and H. Christensen, “Using linguistic and topic analysis to classify sub-groups of online depression communities,” Multimedia Tools and Applications, vol. 76, no. 8, pp. 10 653–10 676, 2017.
  • [26] A. J. Rodriguez, S. E. Holleran, and M. R. Mehl, “Reading between the lines: The lay assessment of subclinical depression from written self-descriptions,” Journal of Personality, vol. 78, no. 2, pp. 575–598, 2010.
  • [27] J. W. Pennebaker, R. L. Boyd, K. Jordan, and K. Blackburn, “The development and psychometric properties of liwc2015,” Tech. Rep., 2015.
  • [28] N. Ramirez-Esparza, C. K. Chung, E. Kacewicz, and J. W. Pennebaker, “The psychology of word use in depression forums in english and in spanish: Texting two text analytic approaches.” in ICWSM, 2008.
  • [29] S. W. Stirman and J. W. Pennebaker, “Word use in the poetry of suicidal and nonsuicidal poets,” Psychosomatic medicine, vol. 63, no. 4, pp. 517–522, 2001.
  • [30] J. Huang, M. Peng, H. Wang, J. Cao, W. Gao, and X. Zhang, “A probabilistic method for emerging topic tracking in microblog stream,” World Wide Web, vol. 20, no. 2, pp. 325–350, 2017.
  • [31] M. J. Paul and M. Dredze, “Discovering health topics in social media using topic models,” PloS one, vol. 9, no. 8, p. e103408, 2014.
  • [32] S. Balani and M. De Choudhury, “Detecting and characterizing mental health related self-disclosure in social media,” in Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems.   ACM, 2015, pp. 1373–1378.
  • [33] N. Andalibi, O. L. Haimson, M. De Choudhury, and A. Forte, “Understanding social media disclosures of sexual abuse through the lenses of support seeking and anonymity,” in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems.   ACM, 2016, pp. 3906–3918.
  • [34] N. Schrading, C. O. Alm, R. Ptucha, and C. Homan, “# whyistayed,# whyileft: Microblogging to make sense of domestic abuse,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 1281–1286.
  • [35] K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, vol. 28, no. 1, pp. 11–21, 1972.
  • [36] P. Legendre and L. F. Legendre, Numerical ecology.   Elsevier, 2012, vol. 24.
  • [37] QSR International, “Ncapture,” 2012. [Online]. Available: https://qsrinternational.com
  • [38] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [39] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using em,” Machine learning, vol. 39, no. 2, pp. 103–134, 2000.
  • [40] H. Wang, Y. Zhang et al., “Detection of motor imagery eeg signals employing naïve bayes based learning process,” Measurement, vol. 86, pp. 148–158, 2016.
  • [41] X. Yi and Y. Zhang, “Privacy-preserving naive bayes classification on distributed data via semi-trusted mixers,” Information systems, vol. 34, no. 3, pp. 371–380, 2009.
  • [42] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986.
  • [43] H. Hu, J. Li, H. Wang, G. Daggard, and M. Shi, “A maximally diversified multiple decision tree algorithm for microarray data classification,” in Proceedings of the 2006 workshop on Intelligent systems for bioinformatics-Volume 73.   Australian Computer Society, Inc., 2006, pp. 35–38.
  • [44] Y. Wang, H. Li, H. Wang, B. Zhou, and Y. Zhang, “Multi-window based ensemble learning for classification of imbalanced streaming data,” in International Conference on Web Information Systems Engineering.   Springer, 2015, pp. 78–92.
  • [45] S. Subramani, H. Wang, S. Balasubramaniam, R. Zhou, J. Ma, Y. Zhang, F. Whittaker, Y. Zhao, and S. Rangarajan, “Mining actionable knowledge using reordering based diversified actionable decision trees,” in International Conference on Web Information Systems Engineering.   Springer, 2016, pp. 553–560.
  • [46] E.-H. S. Han, G. Karypis, and V. Kumar, “Text categorization using weight adjusted k-nearest neighbor classification,” in Pacific-asia conference on knowledge discovery and data mining.   Springer, 2001, pp. 53–65.