View of Sentiment Analysis of Code-Mixed Text: A Review

(1)

Research Article

Sentiment Analysis of Code-Mixed Text: A Review

Nurul Husna Mahadzir1_{, Mohd Faizal Omar}2*_{, Mohd Nasrun Mohd Nawi}3_{, Anas A. Salameh}4_, Kasmaruddin Che Hussin5

1_{Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Kedah,}

08400 Merbok, Kedah, Malaysia.

2_{School of Quantitative Sciences, Universiti Utara Malaysia, 06010 Kedah, Malaysia.}

3_{School of Technology Management and Logistics, Universiti Utara Malaysia, 06010 Kedah, Malaysia.} 4_{College of Business Administration, Prince Sattam bin Abdulaziz University, 165 Al-Kharj 11942,}

Saudi Arabia.

5_{Faculty of Entrepreneurship and Business, Universiti Malaysia Kelantan, Malaysia}

Corresponding Author’s email: [email protected]*

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27 January 2021; Published online: 05 April 2021

Abstract: In recent times, sentiment analysis has become one of the most active research and progressively popular areas in

information retrieval and text mining. To date, sentiment analysis has been applied in various domains such as product, movie, sport and political reviews. Most of the previous work in this field has focused on analyzing only a single language, especially English. However, with the need of globalization and the increasing number of the Internet used worldwide; it is common to see the post written in multiple languages. Moreover, in an unstructured content like Twitter posts, people tend to mix languages in one sentence, which make sentiment analysis process even harder and more challenging. This paper reviews the state-of-the-art of sentiment analysis for code-mixed, which includes the detail discussions of each focus area, qualitative comparison and limitations of current approaches. This paper also highlights challenges along this line of research and suggests several recommendations for future works that should be explored.

Keywords: Sentiment Analysis, code-mixed, language

1. Introduction

Sentiment Analysis (SA) is considered as one of the most active research areas in Natural Language Processing (NLP) since early 2000(B. Liu, et al..2012) Its aim is to automatically detect emotions or opinions conveyed by a speaker or writer based on the subjective information shared especially on the Web (B. Liu, et al..2012; B. Pang, et al..2008; E. Cambria, et al..2013). The importance of this field has been proven by the high number of approaches and techniques proposed in research, as well as by the interest of organizations and companies that it raised in recent years. To date, SA has been applied to a wide variety of topics and issues such as online products reviews (e.g., movies, mobile phones)( G. Di Fabbrizio,et al..2013hotel reviews (M. Vela, et al..2011) political and financial analysis (K. Ahmad, et al.. 2006; Y.E. Soelistio, et al..,2015)

Most of the previous work in this field has focused on analyzing only a single language especially English. However, with the need of globalization and the increasing number of Internet used worldwide, it is common to see the post written in multiple languages which make SA process even harder and more challenging. Furthermore, in an unstructured content like Twitter posts, people tend to mix languages in one sentence. In fact, the practice of using more than one language in a single sentence has arisen and such mixed language has rarely been a subject of SA before. It is crucial to have different approach or technique in order to cater for this kind of data as certain information in another language might miss out if the analysis is done only for a single language. (K. Dashtipour, et al..,2016)

The code-mixed language usage in a daily conversation has arises from the fact that some multilingual speakers feel more comfortable to convey messages in their native language compared to English.

In the next section, research findings in SA for code-mixed text will be discussed. Following which, the qualitative comparison is presented. Then the issues and challenges are highlighted and conclusion and future work are drawn in the last section.

2. Research Findings Sentiment Analysis

(2)

There are two main approaches in SA; subjectivity analysis and sentiment classification. Subjectivity analysis deals with the detection of opinions or sentiments, while sentiment classification focuses on classifying those opinions with various polarities or rankings (A.Montoyo, et al..2012).Some researchers focused on classifying text into positive, negative or neutral (A. Pak, et al. 2010) while other consider various levels of granularity such as highly positive, positive, neutral, negative or highly negative (S. Bhattacharjee, et al..,2015) in their classification.

Figure 1. SA approach (W. Medha, et al.,.2014)

The approaches that have been broadly used in order to face the challenges of SA are either machine learning or lexicon-based approach as depicted in Figure 1. The machine learning approach uses supervised, semi- supervised or unsupervised learning to construct a model from a large training corpus (K. Ravi, et al..2015) As for this approaches, supervised learning techniques have been the most widely used in many SA tasks. This method entails the use of a training corpus to learn a certain classifier function. The efficiency of SA systems using supervised algorithms depends on the combination of appropriate algorithms together with a set of appropriate features. Among the commonly applied sentiment classifiers for supervised learning includes Naïve Bayes (NB) Classifier, Support Vector Machine (SVM) and Maximum Entropy (ME) to classify data into positive or negative categories (B. Liu, 2010)

On the other hand, the lexicon-based approach requires human annotation to manually construct a lexicon and it is divided into the dictionary-based and corpus-based approach. The dictionary-based approach totally depends on available resources such as WordNet to find the opinion seed words while the corpus-based approach is applied not only to obtain the opinion seed words but also to find other opinions words using a large domain-specific corpus (R. Feldman,2013) Specifically, this approach largely relies on lexical resources containing words and their associated sentiment (sentiment lexicons) in order to perform the classification. Among the most well-known sentiment lexicon includes Senti Word Net (K. Denecke, 2008; (A. Kumar, 2014)Word Net ( G. a. Miller,1995) and Sentic Net (E. Cambria, et al..,2014)

A huge amount of previous research has been done in mining the sentiment written in English. Despite the fact that English remains the main language used in various research studies in this area, there are also efforts in other languages such as Japanese (H.T.T.F. Tadashi Kumano Hideki Kashioka, 2003; A. Danielewicz-Betz, et al..,2015) Chinese (H.Y. Lee, et al..,2011) and (Malay A. Alsaffar, et al..,2014). SA for a language is usually relying on manually or semi-automatically constructed lexicons found in dictionaries or corpora (G. Dehong, et al.,2014; A.A. Ríos, et al..,2014; A.B. Muhammad, 2016) The availability of these resources enables the creation of rule-based SA or the construction of training data for classification purposes (D. Sitaram, et al..,2015).

Code-Mixed

Many terms are found in the literature that is used interchangeably to refer to this concept including mixed language, code-mixing, and code-switching. All these terms refer to the use of more than one language in the

(3)

same conversational event either in speaking or writing(R. Bhargava, et al.2016; .J. Gumperz, 1982). Throughout this paper, the term ‘code-mixed’ will be used to refer to this phenomenon.

The use of code-mixed arises from the fact that some multilingual speakers or writers feel more comfortable to convey information in their native language compared to English. Code-mixed text either verbally or in written form is considered common especially in multilingual societies like Malaysia and Singapore. The use of code-mixed is usually found in social media content such as Facebook, Twitter and forums. In Malaysia, social media users tend to mix Malay and English language known as ‘Bahasa Rojak’ in their informal communication (K. Chuah, 2013) Below are the examples of code-mixed posted on Twitter that contains both Malay and English texts:-

Example 1: buku ni brilliant…everyone should read!! Example 2: tahniah Azizul…the Keirin World Champion!

Example 3: jammed teruk from Tapah to Ipoh, dah 2jam stuck kat sini…

The statement in the example above is a mixture of two languages; Malay and English. Words in italic belong to the English language while the rest belongs to the Malay language. Among the issues associated with the use of code-mixed are the grammatical differences and improper switching of languages in one sentence which introduce new challenges in the field of NLP. Therefore, different approach and techniques will be needed in order to achieve comparable performance level to what has been achieved in a single language such as English.

Sentiment Analysis of Code-Mixed

Although a great deal of work has been focusing on analysing data for single and multilingual languages, there are some recent studies have been conducted to analyse code-mixed content as well. A thorough search in the literature based upon the title, abstract and introduction were conducted through several scholarly publication's search engine and online databases such as Scopus, Google Scholar, Springer, ACM, and IEEE. The keyword used to find the articles included ‘sentiment analysis mixed language’, ‘sentiment analysis code-mixed’ and ‘sentiment analysis code-switching’. Articles in conference proceedings as well as refereed journals that included these particular terms were considered. As a result, seventy papers published from 2008 were scanned during this process and forty papers related to SA of code-mixed content were identified and included into the analysis. It has been identified that most efforts concentrated on five focus areas or specific task. The categories included i) pre-processing ii) language identification iii) lexicon creation iv) sentiment classification and v) subjectivity analysis. It was worthy of note that although this review is written by each focus area, some of the previous works are also proposed more than one focus area in their study. Table 1 summarized SA for code-mixed related publications, their language pairs, and research focus. The detailed review of each research focus is discussed in the following section.

Table 1. Research focuses areas Publication Language Pairs Pre –

processing Language Identification Lexicon Creation Sentiment Classification Subjectivity Analysis [31] Maltese –English √ [32] English – Spanish √ [33] English – Bengali √ [34] Urdu – English √ √ [35] Mandarin – English √ [36] Chinese – English √ [37] Malay – English √ [38] English – 30 non-English language √ [39] English – Arabic √ √ [40] Chinese – English √ √ √ [41] English – Spanish √ [42] Chinese – English √ [43] English – Hindi √ [44] English – Hindi – Bengali √ [45] English – Hindi √

(4)

[46] English – Hindi √ √ [47] English – Hindi √ √ √ [48] English – German √ [27] English – Hindi √ [49] English – Hindi √ √ [50] English – Hindi √ [51] English – Spanish √ [52] Chinese – English √ [53] English – Bangla √ [54] Chinese – English √ [55] English – Hindi √ [56] English – Bangla √ [28] English-Hindi √ √ [57] Singaporean English √ [58] English – Spanish √ [59] English – Hindi √ [60] English – Hindi √ [61] English – Manipuri √ [62] English-Bengali √ √ [63] Hindi-English- Bengali- Gujarati √ [64] English – Spanish √ [65] English – Portuguese √ [66] English – Chinese √ Pre-processing

Early work on this subject matter has focused on pre-processing or normalization task which involves the activities such as identification of noisy text, correction of spelling and stop words removal N. Samsudin, et al..,2013; Y. Vyas, 2014) Normalization of mix English and Bangla language was studied by (Dutta et al. 2015)and they focused on spelling correction using noisy channel model. (Zhang, Chen & Huang 2014) introduced word translation and word categorization methods to perform normalization on Chinese and English texts. For word translation, neural network language model was used to translate in-vocabulary English words to Chinese, while for out-of-vocabulary words; a graph-based unsupervised model is applied to categorize them. (Sitaram et al,2015) focused on normalizing four elements of code-mix in Indian social media which is phonetic typing; abbreviations, wordplay and slang words and they have achieved an accuracy of 85% with their model.

Language Identification

Identifying languages of words are considered as one of the most significant tasks in code-mixed content. The majority of the previous works have used word level approach in identifying languages (U. Barman,et al.., 2014; A. Das, et al..,2014; S. Dutta, et al..,2015; P. Lamabam,et al..,2016) In other research works, ( Sharma et al. 2015) proposed to use the nearest neighbor approach in dealing with ambiguous words during language identification phase on a mix of English – Hindi language and lexicon-based approach has been applied to judge the sentiment of a statement.

(King, Abney, 2013) proposed a weakly supervised method to perform word level language identification in multilingual documents while (Barman et al. 2014) used a hybrid approach to perform language identification task in three languages; Bengali, English, and Hindi. The approach was able to classify the ambiguous words as they take contextual clues into consideration during classification task. (Rudra et al..,2016) reported language preference for tweets by Indian users written in Hindi and English language and they argued that Hindi is preferred when expressing a negative opinion and swearing. (Nguyen,Dogruoz 2013) performed language identification on randomly selected posts of Turkish and Dutch language from an online chat forum and has used manual annotation to annotate the data.

Lexicon Creation

Another related research is concerning on the construction of sentiment lexicon. (Lee,Wang 2015) have annotated and analyzed the English – Chinese lexicon with five basic emotions; happiness, sadness, fear, anger,

(5)

and surprise. They used Multiple Classifier System (MCS) to do the classification. In the same line of study, (Li, Yu, Fung 2012) have developed Mandarin – English sentiment lexicon based on code-switching speech and text data which includes both intra-sentential and inter-sentential code-switching in the lexicon. For the text data, an algorithm to automatically downloading the code-switching data from Chinese language news has been developed. (Vilares et al..2017) have proposed English – Spanish corpus of tweets with code-switching. The annotation of each tweet was based on Senti Strength criteria and they have applied a trinary scale (positive, neutral and negative categories) to classify the polarity (M. Thelwall, et al..,2010)

Sentiment Classification / Polarity Detection

One of the major tasks in any SA activities is sentiment classification or polarity detection. As mentioned previously, there are two main approaches in classifying sentiments which is machine learning and lexicon-based approach. In code-mixed environment, various methods within both approaches have been applied in judging the sentiments.

Machine Learning

Various machine learning techniques such as Naive Bayes (NB), Support Vector Machines (SVM), Maximum Entropy (MaxEnt) etc. have been applied to classify the sentiments. (Narr et al. 2011) concluded 71.5% accuracy with code-mixed using NB classifier on unigrams. Mukund & Srihari address issues and challenges related to Urdish blog data consist of Urdu mixed with English and proposed to use statistical Part-of-Speech (PoS) tagger and Structural Correspondence Learning (SCL) for the classification task. Sitaram et al. trained a classifier on the mixed English – Hindi language data directly rather than translated to a single language. The technique used was able to learn the grammatical transitions of both languages. Vilares et al. applied the various machine learning approaches to classify the polarity in three different environments. Raghavi et allearn a basic Support Vector Machine (SVM) based question classification system for English - Hindi data. All the data have been translated into English before feature selection and classifications were performed. In contrast, Yan et al. proposed a bilingual approach to process review comments written in Chinese and English. Their models are able to analyse sentiments without translation and to process two different languages simultaneously. While Wang et al. predicted emotion using a joint factor graph model by considering both bilingual and emotional information and their models were able to significantly outperform the baseline model with a p-value less than 0.01.

Lexicon-based

Lo et al. have constructed a toolkit to analyse polarity for Singlish (Singaporean English) using a semi-supervised approach. Unlike previous research, which relying on English knowledge-based such as SenticWordNet and Word Net, Lo has used SenticNet which includes 30,000 common sense concept, negation and adversative terms handling as the core resource for their polarity detection. Gajakosh et al. have applied fuzzy sets to classify five different polarity categories for hotel reviews.

Subjectivity Analysis

One of the efforts in performing subjectivity analysis is found in [33] where they have generated a Bengali subjectivity lexicon and proposed Conditional Random Field (CRF) based approach as a subjectivity classifier for mixed English and Bengali language. (Abdul-Mageed et al. 2011) have developed a system known as SAMAR (Subjectivity and Sentiment Analysis for Arabic Social Media) where one of the focus is to analyse subjective information in various dialectical Arabic language.

Qualitative Comparison

Table 2 summarizes the limitations of different techniques and a dataset of previous research in the area of this study.

Table 2. Comparisons of various techniques used, datasets and limitations

References Focus Area Technique Domain/Dataset Limitations [42] Normalization Noisy channel approach,

neural network

Sina Weibo: 210 million post

Lack of contextual information

(6)

Classification: graph-based unsupervised method [57] Lexicon Creation, Sentiment Classification Semi supervised approach: SVM

Twitter Concept ambiguity was removed manually

[49] Normalization, Language Identification

Classification: Lexicon-based approach, Hindi SentiWordNet, WordNet

Data from Forum for IR Evaluation 2013, 2014, 500 posts from Facebook and Youtube

Not able to deal with ambiguous words correctly

[27] Sentiment Classification

Sentiment combination rules: recursive neural tensor network

Facebook and Google Plus: Reviews on Virat Kohli Training data: 345 Testing data: 97

Does not cover the ambiguous sentiment [43] Language Identification Unsupervised dictionary based approach, supervised classification (context based SVM), Conditional Random Fields Facebook group of Indian Univ students (2335 post, 9813 comments)

The classifier without contextual clues did not perform well for the Hindi language [38] Language Identification Weakly supervised method, n-grams, CRF model 643 languages from four monolingual samples

Not able to handle named entity, incorrectly classify shared words [51] Sentiment

Classification

Supervised model based on bag-of-words

Twitter (3062 tweets) The accuracy obtained on code-switch corpus is lower than monolingual corpus [28] Language Identification, Sentiment Classification

L.I: Machine Learning S.C: SentiWordNet Final score: statistical technique

Training

set:banglalyrics.net, Tamil lyrics, Indic websites Ambiguous words manually removed [61] Language Identification Trigram based CRF model -> perform better Twitter (700 code-mixed tweets) Facebook (300 code-mixed posts)

Data only tokenized, no pre-processing task has been adopted

[45] Language Identification

Supervised: n-grams, dictionary-based, SVM

Facebook posts Error found on language boundary detection and evaluation on testing data do not perform well [59] Pre-processing, Sentiment Classification CRF Supervised approach Shared task by 12th Conf on NLP Cannot disambiguate similar tags

3. Open Issues and Challenges

The studies that have been done in SA of code-mixed language pose some issues. This paper will highlight three of the issues; language pairs, ambiguous words and subjectivity analysis.

(7)

Language Pairs

As shown in Figure 2, there is a large volume of published studies concentrating on a mix of English language with Hindi (D. Sitaram, et al..,2015; Y. Vyas, et al., 2014; S. Sharma,et al..,2015; A. Jamatia, et al..,2015) or Chinese ( J. Zhao, et al..,2012; Q. Zhang, et al..,2014; S. Lee,et al..,2015) . It is observed that only a few studies have been carried out on other language pairs such as German, Portuguese and Malay. As a phenomenon of code-mixed is very common to many multicultural and multilingual countries, it is important to conduct more researches for various languages to cater for this need.

Figure 2. Language pairs. Ambiguous Words

Another common challenge in SA of code-mixed is to deal with ambiguity issue. Ambiguity in code-mixed environment can exist in few situations. First, a word may share similar spelling by multiple languages but it carries different meaning (S. Sharma, et al..,2015; S.K. Singh , et al, 2017)For example, the word “fail” exists in both Malay and English language. In English, the meaning of fail is unsuccessful, whereas in Malay it means a file or a folder. Second, one single word may carry multiple meaning in a language such as madu (honey from the bee / women sharing a husband). Madu in the example is a Malay word which has two different meaning. Word-by-word language identification as practiced by most previous research failed to accurately identify this ambiguous word since it is spelled similarly. It is required to take the surrounding words into consideration in order to get a sense and context information in identifying the word (A. Chanda,et al.,,2016)

Subjectivity Analysis

Little attention has been paid to the study of subjectivity analysis task. Subjectivity analysis is usually implemented prior to detail sentiment analysis and it is considered as an essential task to categorize the subjective and objective sentences. It is worth investigating if the outcome from the analysis can improve the sentiment classification of the system as a whole.

4. Conclusion and Future Work

In this paper, a comprehensive literature search on numerous state-of-the-art sentiment analysis of code-mixed, illustrating the current trend of the domain has been performed. The literature review revealed that most of the research efforts into SA for code-mixed have centered on the pre-processing, language identification, lexicon construction, and sentiment classification task.

In the future, it will be necessary to cater for another language pairs and to resolve the ambiguity issues by going beyond word level analysis in order to understand the context and sentiment it conveys.

5. Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper. 13.51 40.54 8.11 27.03 5.41 2.70 2.70 English - Spanish English - Hindi English - Arabic English - Chinese English - Malay English -Portuguese English - German

(8)

6. Funding Statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

1. B. Liu, Sentiment analysis and opinion mining, Synth. Lect. Hum. Lang. Technol. 5 (2012) 1–167. 2. B. Pang, L. Lee, Opinion Mining and Sentiment Analysis, 2008. doi:10.1561/1500000011.

3. E. Cambria, B. Schuller, Y. Xia, C. Havasi, New Avenues in Opinion Mining and Sentiment Analysis, IEEE Intell. Syst. 28 (2013) 15–21. doi:10.1109/MIS.2013.30.

4. G. Di Fabbrizio, A.J. Stent, R. Gaizauskas, Summarizing Opinion-Related Information for Mobile Devices, in: Mob. Speech Adv. Nat. Lang. Solut., Springer New York, New York, NY, 2013: pp. 289– 317. doi:10.1007/978-1-4614-6018-3_11.

5. M. Vela, Sentiment Analysis for Hotel Reviews, Proc. Comput. Linguist. Conf. 231527 (2011) 45–52. doi:10.1051/matecconf/20167503002.

6. K. Ahmad, D. Cheng, Y. Almas, Multi-lingual Sentiment Analysis of Financial News Streams, Proc. 1st Intl Conf. Grid Technol. Financ. Model. Simul. (2006). doi:10.1109/IV.2005.143.

7. Y.E. Soelistio, M. Raditia, S. Surendra, Simple Text Mining for Sentiment Analysis of political figure using naïve bayes classifier method, (2015) 99–104.

8. K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A.Y.A. Hawalah, A. Gelbukh, Q. Zhou, Multilingual Sentiment Analysis: State of the Art and Independent Comparison of Techniques, Cognit. Comput. 8 (2016) 757–771.

9. A. Montoyo, P. Martínez-Barco, A. Balahur, Subjectivity and sentiment analysis: An overview of the current state of the area and envisaged developments, Decis. Support Syst. 53 (2012) 675–679.

10. A. Pak, P. Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, LREc. 10 (2010).

11. S. Bhattacharjee, A. Das, U. Bhattacharya, S.K. Parui, S. Roy, Sentiment analysis using cosine similarity measure, in: 2015 IEEE 2nd Int. Conf. Recent Trends Inf. Syst., IEEE, 2015: pp. 27–32. 12. W. Medhat, A. Hassan, H. Korashy, Sentiment analysis algorithms and applications: A survey, Ain

Shams Eng. J. 5 (2014) 1093–1113.

13. K. Ravi, V. Ravi, A survey on opinion mining and sentiment analysis: Tasks, approaches and applications, 2015.

14. B. Liu, Sentiment Analysis and Subjectivity, in: Handb. Nat. Lang. Process., 2010: pp. 1–38. 15. R. Feldman, Techniques and applications for sentiment analysis, Commun. ACM. 56 (2013) 82. 16. K. Denecke, Using SentiWordNet for multilingual sentiment analysis, in: Proc. - Int. Conf. Data Eng.,

2008: pp. 507–512.

17. A. Kumar, A. R, Sentiment Analysis Using Sentiwordnet And Semantic Approach, Int. J. Adv. Inf. Arts Sci. Manag. ISSN. 1 (2014).

18. G. a. Miller, WordNet: a lexical database for English, Commun. ACM. 38 (1995) 39–41.

19. E. Cambria, D. Olsher, D. Rajagopal, SenticNet 3: a common and common-sense knowledge base for cognition-driven sentiment analysis, in: Twenty-Eighth AAAI Conf., 2014: pp. 1515–1521.

20. H.T.T.F. Tadashi Kumano Hideki Kashioka, Construction and analysis of Japanese-English broadcast news corpus with named entity tags, ACL2003

21. A. Danielewicz-Betz, H. Kaneda, M. Mozgovoy, M. Purgina, Creating English and Japanese Twitter Corpora for Emotion Analysis, People. 1634 (2015) 5869.

22. H.Y. Lee, H. Renganathan, Chinese Sentiment Analysis Using Maximum Entropy, (2011) 89–93. 23. A. Alsaffar, N. Omar, Study on feature selection and machine learning algorithms for Malay sentiment

classification, in: Proc. 6th Int. Conf. Inf. Technol. Multimed., 2014: pp. 270–275. 24. G. Dehong, Cross-Lingual Sentiment Lexicon Learning, 2014.

25. A.A. Ríos, P.J. Amarilla, G.A.G. Lugo, Sentiment categorization on a creole language with lexicon-based and machine learning techniques, Proc. - 2014 Brazilian Conf. Intell. Syst. BRACIS 2014. (2014) 37–43.

26. A.B. Muhammad, Contextual Lexicon-based Sentiment Analysis for Social Media, 2016.

27. D. Sitaram, S. Murthy, D. Ray, D. Sharma, K. Dhar, Sentiment analysis of mixed language employing Hindi-English code switching, Mach. Learn. Cybern. (ICMLC), 2015 Int. Conf. 1 (2015) 271–276. 28. R. Bhargava, Y. Sharma, S. Sharma, Sentiment Analysis for Mixed Script Indic Sentences, in: Adv.

Comput. Commun. Informatics (ICACCI), 2016 Int. Conf. IEEE, 2016: pp. 524–529. doi:10.1109/ICACCI.2016.7732099.

(9)

30. K. Chuah, Aplikasi Media Sosial Dalam Pembelajaran Bahasa Inggeris : Persepsi Pelajar Universiti, Issues Lang. Stud. 2 (2013) 56–63.

31. P.-J. Farrugia, TTS Pre-processing Issues for Mixed Language Support, in: Proc. CSAW’04, 2004: pp. 36–41.

32. T. Solorio, Y. Liu, Part-of-Speech Tagging for English-Spanish Code-Switched Text, in: Proc. Conf. Empir. Methods Nat. Lang. Process., 2008: pp. 1051–1060.

33. A. Das, S. Bandyopadhyay, Subjectivity Detection in English and Bengali: A CRF-based Approach, in: Proceeding ICON 2009, 2009.

34. S. Mukund, R. Srihari, Analyzing Urdu social media for sentiments using transfer learning with controlled translations, in: Proc. Second Work. Lang. Soc. Media, 2012: pp. 1–8.

35. Y. Li, Y. Yu, P. Fung, A Mandarin-English Code-Switching Corpus, in: Lr. 2012 - Eighth Int. Conf. Lang. Resour. Eval., 2012: pp. 2515–2519.

36. J. Zhao, X. Qiu, S. Zhang, F. Ji, X. Huang, Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features, in: Proc. 2012 Jt. Conf. Empir. Methods Nat. Lang. Process. Comput. Nat. Lang. Learn., 2012: pp. 1379–1388.

37. N. Samsudin, A.R. Hamda, M. Puteh, M.Z.A. Nazri, Mining Opinion in Online Messages, Int. J. Adv. Comput. Sci. Appl. 4 (2013) 19–24. http://ijacsa.thesai.org/.

38. B. King, S. Abney, Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods, in: HLT-NAACL, 2013: pp. 1110–1119.

39. M. Abdul-Mageed, M. Diab, S. Kübler, SAMAR: Subjectivity and sentiment analysis for Arabic social media, Comput. Speech Lang. 28 (2014) 20–37.

40. G. Yan, W. He, J. Shen, C. Tang, A bilingual approach for conducting Chinese and English social media sentiment analysis, Comput. Networks. 75 (2014) 491–503. doi:10.1016/j.comnet.2014.08.021. 41. S. Vicente, R. Agerri, G. Rigau, Simple , Robust and (almost) Unsupervised Generation of Polarity

Lexicons for Multiple Languages, in: EACL, 2014: pp. 88–97.

42. Q. Zhang, H. Chen, X. Huang, Chinese-English Mixed Text Normalization, in: Proc. 7th ACM Int. Conf. Web Search Data Min., 2014: pp. 433–442. doi:10.1145/2556195.2556228.

43. U. Barman, A. Das, J. Wagner, J. Foster, Code Mixing: A Challenge for Language Identification in the Language of Social Media, in: EMNLP 2014, 2014: p. 13.

44. U. Barman, J. Wagner, G. Chrupała, J. Foster, DCU-UVT: Word-Level Language Classification with Code-Mixed Data, in: Proc. 2014 Conf. Empir. Methods Nat. Lang. Process., 2014: pp. 127–132. 45. A. Das, B. Gamback, Identifying Languages at the Word Level in Code-Mixed Indian Social Media

Text, in: Proc. 11th Int. Conf. Nat. Lang. Process. Goa, India, 2014: pp. 169–178.

46. Y. Vyas, S. Gella, J. Sharma, K. Bali, M. Choudhury, POS Tagging of English-Hindi Code-Mixed Social Media Content, in: Proc. Conf. Empir. Methods Nat. Lang. Process., 2014: pp. 974–979.

47. S. Sharma, P.Y.K.L. Srinivas, R.C. Balabantaray, Sentiment analysis of code - Mix script, in: 2015 Int. Conf. Comput. Netw. Commun. CoCoNet 2015, 2015: pp. 530–534. doi:10.1109/CoCoNet.2015.7411238.

48. E. Gredel, Metaphorical patterns and the subprime mortgage crisis: Towards cross-linguistic, discourse-specific and n-gram-based dictionaries for sentiment analysis, Stud. Commun. Sci. 15 (2015) 37–44. doi:10.1016/j.scoms.2015.03.003.

49. S. Sharma, P.Y.K.L. Srinivas, R.C. Balabantaray, Text normalization of code mix and sentiment analysis, in: 2015 Int. Conf. Adv. Comput. Commun. Informatics, ICACCI 2015, 2015: pp. 1468–1473. doi:10.1109/ICACCI.2015.7275819.

50. A. Jamatia, B. Gamback, A. Das, Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages, in: Proc. Recent Adv. Nat. Lang. Process., 2015: pp. 239–248. doi:10.13140/RG.2.1.1222.0640.

51. D. Vilares, M.A. Alonso, C. Gómez-Rodriguez, Sentiment Analysis on Monolingual, Multilingual and Code-Switching Twitter Corpora, in: Proc. 6th Work. Comput. Approaches To Subj. Sentim. Soc. Media Anal., 2015: pp. 2–8.

52. S. Lee, Z. Wang, Emotion in Code-switching Texts : Corpus Construction and Analysis, in: ACL-IJCNLP 2015, 2015: pp. 91–99.

53. S. Dutta, T. Saha, S. Banerjee, S.K. Naskar, Text normalization in code-mixed social media text, in: 2015 IEEE 2nd Int. Conf. Recent Trends Inf. Syst. ReTIS 2015 - Proc., 2015: pp. 378–382. doi:10.1109/ReTIS.2015.7232908.

54. Z. Wang, S.Y.M. Lee, S. Li, G. Zhou, Emotion Detection in Code-switching Texts via Bilingual and Sentimental Information, in: Proc. 53rd Annu. Meet. Assoc. Comput. Linguist. 7th Int. Jt. Conf. Nat. Lang. Process. (Volume 2 Short Pap., 2015: pp. 763–768.

55. K.C. Raghavi, M. Chinnakotla, M. Shrivastava, "Answer ka type kya he?" Learning to Classify Questions in Code-Mixed Language, in: Proc. 24th Int. Conf. World Wide Web, 2015: pp.

(10)

853–858. doi:10.1145/2740908.2743006.

56. S. Banerjee, A. Kuila, A. Roy, S.K. Naskar, P. Rosso, S. Bandyopadhyay, A Hybrid Approach for Transliterated Word-Level Language Identification:: CRF with Post-Processing Heuristics, in: Proc. Forum Inf. Retr. Eval. - FIRE ’14, ACM Press, New York, New York, USA, 2015: pp. 54–59. doi:10.1145/2824864.2824876.

57. S.L. Lo, E. Cambria, R. Chiong, D. Cornforth, A multilingual semi-supervised approach in deriving Singlish sentic patterns for polarity detection, Knowledge-Based Syst. 105 (2016) 236–247. doi:10.1016/j.knosys.2016.04.024.

58. D. Vilares, C. Gómez Rodríguez, M.A. Alonso, EN-ES-CS : An English-Spanish Code-Switching Twitter Corpus for Multilingual Sentiment Analysis, in: Proc. Tenth Int. Conf. Lang. Resour. Eval., 2016: pp. 4149–4153.

59. S. Ghosh, S. Ghosh, D. Das, Part-of-speech Tagging of Code-Mixed Social Media Text, in: EMNLP 2016, 2016: pp. 90–97.

60. K. Rudra, S. Rijhwani, R. Begum, K. Bali, M. Choudhury, N. Ganguly, Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?, in: Proc. 2016 Conf. Empir. Methods Nat. Lang. Process., 2016: pp. 1131–1141.

61. P. Lamabam, K. Chakma, A Language Identification System for Code-Mixed English-Manipuri Social Media Text, in: 2nd IEEE Int. Conf. Eng. Technol. (ICETECH), 17th& 18thMarch 2016, Coimbatore, TN, India., 2016: pp. 79–83.

62. A. Chanda, D. Das, C. Mazumdar, Unraveling the English-Bengali Code-Mixing Phenomenon, in: Proc. Second Work. Comput. Approaches to Code Switch., 2016: pp. 80–89.

63. N. Bjørner, S. Prasad, L. Parida, Language Identification and Disambiguation in Indian Mixed-Script, in: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 2016: pp. 113–121. doi:10.1007/978-3-319-28034-9.

64. D. Vilares, M.A. Alonso, C. Gómez-Rodríguez, Supervised sentiment analysis in multilingual environments, Inf. Process. Manag. 53 (2017) 595–607. doi:10.1016/j.ipm.2017.01.004.

65. K. Becker, V.P. Moreira, A.G.L. dos Santos, Multilingual emotion classification using supervised learning: Comparative experiments, Inf. Process. Manag. 53 (2017) 684–704. doi:10.1016/j.ipm.2016.12.008.

66. Z. Wang, S. Lee, S. Li, G. Zhou, Emotion Analysis in Code-Switching Text with Joint Factor Graph Model, IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (2017) 469–480. doi:10.1109/TASLP.2016.2637280.

67. B. King, S. Abney, Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods, in: HLT-NAACL, 2013: pp. 1110–1119.

68. D.-P. Nguyen, A.S. Dogruoz, Word level language identification in online multilingual communication, Assoc. Comput. Linguist. (2013).

69. M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, A. Kappas, Sentiment strength detection in short informal text, J. Am. Soc. Inf. Sci. Technol. 61 (2010) 2544–2558. doi:10.1002/asi.21416.

70. S. Narr, M. Ulfenhaus, S. Albayrak, Language-Independent Twitter Sentiment Analysis, Knowl. Discov. Mach. Learn. (2012) 12–14.

71. M. Abdul-Mageed, M.T. Diab, Subjectivity and Sentiment Annotation of Modern Standard Arabic Newswire, in: Proc. 49th Annu. Meet. Assoc. Comput. Linguist. Hum. Lang. Technol. Short Pap., 2011: pp. 110–118.

72. S.C.; Carter, W.; Weerkamp, E. Tsagkias, S. Carter, @bullet Wouter, W. @bullet, M. Tsagkias, S. Carter, Á.W. Weerkamp, Á.M. Tsagkias Isla, W. Weerkamp, M. Tsagkias, Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text Microblog language identification: overcoming the limitations of short, unedited and idiomatic text, Lang. Resour. Eval. Lang Resour. Eval. 47 (2013).

73. D. Jurgens, S. Dimitrov, D. Ruths, EMNLP 2014 First Workshop on Computational Approaches to Code Switching Proceedings of the Workshop, in: EMNLP 2014, 2014: pp. 51–61.

74. S.K. Singh, K.S. Manoj, Importance and Challenges of Social Media Text, Int. J. Adv. Res. Comput. Sci. 8 (2017) 2015–2018.