India has a linguistically diverse population due to its long history of foreign acquaintances. English, one of those borrowed languages, became an integral part of the education system and hence gave rise to a population who are very comfortable using bilingualism in their day to day communication. Due to such language diversity and dialects, frequent code-mixing is encountered during conversations. Further, due to the emergence of social media, the practice has become even more widespread. The phenomenon is so common that it is often considered as a different (emerging) variety of the language, e.g., Benglish (Bengali-English) and Hinglish (Hindi-English).
This phenomenon poses a great challenge to the existing domains of Natural Language Processing (NLP) such as Sentiment Analysis as primarily the language technologies, such as parsing, Parts-of-Speech (POS) tagging, etc., are built for English. Furthermore, labeled/annotated data of such category are hard to come by and hence leads to misfiring when using straight-forward machine learning algorithms.
In this work, we participated in SemEval-2020 Sentimix Task111https://competitions.codalab.org/competitions/20654 and attempted to solve the chore of sentiment analysis of English-Hindi code-mixed sentences.
Initially, our approach includes the use of feature extraction algorithms on the data, procured by the organizers. Thereafter, we used Support Vector Regression coupled with Grid Search algorithm to classify the code-mixed sentences to its respective sentiment class. This approach, when tested using the metrics prepared by the organizers, returned an f1-score of 66.2%.
The rest of the paper is organized as follows. Section 2 briefly the quantifies the English-Hindi code-mixed data procured by the organizers of the task. Section 3 provides a descriptive literature of our proposed approach. This will be followed by the results and concluding remarks in Section 4 and 5.
The English-Hindi code-mixed data that was used to train our model was collected from Twitter using the Twitter API, by searching for keywords and constructions that are often included in offensive messages . The sentiment labels are positive, negative, and neutral. Besides the sentiment labels, the language labels for every word of the code-mixed sentence were also provided. The word-level language tags were ENG (English), HIN (Hindi), and O (Other) for symbols, mentions, and hashtags.
The organizers provided a trial and a training data set and after adding both, we could gather 17,000 code-mixed instances. We further divided this data into two parts; (i.) 15,000 instances as training data and (ii.) 2,000 instances as validation data.
Our approach included converting the given tweets into a sequence of words and then run the Grid Search Cross-Validation algorithm on the processed tweet. Initially, the tweets were pre-processed using methods as done by  to remove the following:
Contracting white space
Extracting words from hashtags
The last step consisted of taking advantage of the Pascal Casing of hashtags (e.g. #CoronaVirus). A simple regex can extract all words. This extraction results in better performance mainly because words in hashtags, to some extent, may convey sentiments of hate. They play an important role during the model-training stage.
3.1 Feature Extraction
After obtaining clean tweets, various features were extracted by treating them as a sequence of words. Some of the features were manually extracted while some were extracted using pre-existing methodologies like the Bag-of-Words model, GloVe vectors. As our aim is Sentiment Analysis of the texts, so the presence of hate, offense, humor, etc., may have a great influence on the result. The extracted features are listed below.
TF-IDF Vector features: The TF-IDF feature vectors for the texts as a sequence of word vectors.
GloVe Vector features: GloVe vector embeddings for the texts as a sequence of word embeddings.
Humour label and score: Whether a text is humorous or not. If humorous what is its score in the range 0-1.
Wordwise sentiment values: List of sentiment values of each word of the text.
Hate and offensiveness labels: Whether the text is offensive or not and if it constitutes hate speech.
Frequency of easy and difficult words: Included as a semantic feature for the texts. 
3.2 Learning Model
Grid search refers to the practice of tuning hyperparameters to determine the most optimal values for a given model. This has a massive significance as the performance of the entire model is highly dependent on the hyperparameter values specified.
The estimator parameter of the Grid Search Cross-Validation process requires the model that has been used for the hyperparameter tuning process. Here the model used is the linear and the RBF kernels of the estimator Support Vector Regression model (SVR).
This process requires certain parameters to be taken as manual input. The param_grid parameter itself in turn requires a list of parameters and the range of values for each parameter of the specified estimator. The flow diagram has been shown in Fig 1:
The SVR was fed with parameter values of
Class weight and degree were set to Ellipsis.
The most significant parameters required when working with the RBF kernel of the SVR model were ”c”, ”gamma” and ”epsilon”. A list of values to choose from has been given to each hyperparameter of the model.
For the GridSearchCV algorithm, parameters like error_score, iid, param_grid, pre_dispatch, refit, return_train_score, scoring, and verbose were set to Ellipsis.
A cross validation process is performed in order to determine the hyper parameter value set which provides the best f1-score levels. The parameters for hyper-parameter selection are as follows:
Experimentation has been performed thoroughly and the parameters giving the best results have been accepted.
The metric for evaluating the participating systems was as follows. The organizers used F1 averaged across the positives, negatives, and the neutral. The final ranking was based on the average F1 score. Our submitted system garnered an F1 score of 66.2%. The detailed results are shown in Table 1:
In the current work, we attempted to solve the problem of Sentiment Analysis of code-mixed English-Hindi data, while participating in the SemEval shared task. Our system was based on using traditional machine learning algorithms coupled with Beam Search Cross-Validation. Our system, when evaluated by the organizers garnered an F1 score of 0.662. There was an option of developing an unconstrained system, but we only used the provided data to develop the system. As future work, we would like to increase this data, use state-of-the-art Neural Network architectures on this data, taking into advantage the concept, matrix and embedded language, SentiWordNet, and other NLP features.
Word difficulty prediction using convolutional neural networks. In TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), Vol. , pp. 1109–1112. Cited by: item 6.
-  (2012) Random search for hyper-parameter optimization. Journal of machine learning research 13 (Feb), pp. 281–305.
-  (2019) Sentence simplification using syntactic parse trees. In 2019 4th International Conference on Information Systems and Computer Networks (ISCON), Vol. , pp. 672–676.
-  (2020) Normalization of numeronyms using nlp techniques. In 2020 IEEE Calcutta Conference (CALCON), Vol. , pp. 7–9.
-  (2019-06) The titans at SemEval-2019 task 6: offensive language identification, categorization and target identification. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 759–762. External Links:
-  (2019) The titans at semeval-2019 task 5: detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 494–497. Cited by: §3.
Humor analysis based on human annotation (haha)-2019: humor analysis at tweet level using deep learning. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEURWS, Bilbao, Spain (9 2019), Cited by: item 3.
-  (2020) Analyzing code-switching rules for english–hindi code-mixed text. In Emerging Technology in Modelling and Graphics, J. K. Mandal and D. Bhattacharya (Eds.), Singapore, pp. 137–145. External Links:
-  (2019) Code-mixed to monolingual translation framework. In Proceedings of the 11th Forum for Information Retrieval Evaluation, FIRE ’19, New York, NY, USA, pp. 30–35. External Links:
-  (7-12) Preparing bengali-english code-mixed corpus for sentiment analysis of indian languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), K. Shirai (Ed.), Paris, France (english). External Links:
-  (2018) Sentiment analysis of code-mixed indian languages: an overview of sail_code-mixed shared task@ icon-2017. arXiv preprint arXiv:1803.06745.
-  (2020-09) SemEval-2020 Sentimix Task 9: Overview of SENTIment Analysis of Code-MIXed Tweets. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain. Cited by: §2.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.
-  (2016) Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. arXiv preprint arXiv:1611.00472.
Automated recognition of epileptic eeg states using a combination of symlet wavelet processing, gradient boosting machine, and grid search optimizer. Sensors 19, pp. 219. External Links: Cited by: §3.2.