sentiment-annotated lexicon construction for an urdu ... - Paas.com.pk

Pakistan Journal of Science (Vol. 63 No. 4 Dec, 2011) 

SENTIMENT-ANNOTATED LEXICON CONSTRUCTION FOR AN URDU TEXT BASED 

SENTIMENT ANALYZER 

Afraz Z. S., A. Muhammad and Martinez-Enriquez A. M * 

Department of CS & E, U. E. T., Lahore, Pakistan 

** 

Department of CS, CINVESTAV-IPN, D.F. Mexico 

Corresponding author’s email (afrazsyed@uet.edu.pk) 

ABSTRACT: A lexicon based sentiment analyzer is composed of two parts: a classifier and a 

lexicon of sentiment-annotated words/phrases. In this paper, a model for such a lexicon is presented, in 

which the polarity scores are annotated with all the subjective entries. This approach handles Urdu 

words, which are morphologically rich and results into a much higher level of lexicon intricacy than 

the other languages, like English. This is a pioneering effort, as no sentiment-annotated lexicon exists 

for Urdu language. Moreover, already developed lexicons of other languages cannot be used, because, 

Urdu exhibits, exceptionally distinctive orthographical, morphological, and grammatical features. This 

lexicon is constructed as a part of a lexicon based sentiment analyzer for opinionated Urdu text, given 

in the form of reviews. After applying the developed lexicon on multiple reviews, it is observed that 

the results are meeting the expectations. 

Key words: Natural language processing, computational linguistics, sentiment analysis, opinion mining, shallow 

parsing, Urdu text processing, lexicon construction. 

INTRODUCTION 

The rapid proliferation of the user generated text 

on the internet has given rise to a number of previously 

unknown aspects of the natural language processing and 

understanding. This is an obvious fact that such a huge 

body of knowledge generated by millions of minds 

around the world cannot be left free and unbridled 

(Glaser et al., 2002). As a result, the field of sentiment 

analysis, opinion mining, or subjectivity analysis is 

emerging rapidly as an unexplored frontier. For English 

language, this area is under consideration from the last 

decade (Hatzivassiloglou and Wiebe, 2000; Turney 2002; 

Yu and Hatzivassiloglou, 2003 and Pang and Lee, 2008). 

These contributions present a complete model of a 

sentiment analyzer based on different techniques and 

approaches like supervised or unsupervised machine 

learning or lexicon based, etc. 

In these works, a usual model of a sentiment 

analyzer incorporates two components: (a) the classifier 

which analyzes and categorizes the given text and (b) the 

lexicon or lexicons containing the information about the 

orientations of the entries (words/ phrases) as positive or 

negative. These lexicons are called sentiment-annotated 

lexicons (Pang and Lee, 2008), because the polarity 

marks indicated for orientation are annotated directly to 

the lexicon entries. Such lexicons can either be manually 

compiled or automatically generated. A considerable 

percentage of research has emerged in the sentiment 

annotated lexicon construction within a few years (Annett 

and Kondrak, 2008; Higashinaka et al., 2007; 

Andreevskaia and Bergler, 2006; Hu and Lui, 2005; Yu 

and Hatzivassiloglou, 2003; Riloff et al., 2003; Turney, 

2002 and Hatzivassiloglou and Wiebe, 2000). These 

contributions have proposed a variety of approaches for 

the lexicon development, their structures and the 

relationships between the entries. 

Mainly these efforts are for English language 

and exploit pre-developed linguistic recourses like 

corpuses for the development and extraction of the 

required lexicons. Consequently, for English language 

this aspect of research is no more an unsolved issue. On 

the other hand, Urdu is a recourse poor language 

(Mukund et al, 2010) and hence, the task of domain 

specific sentiment annotated lexicon construction for 

Urdu text poses many challenges. To our knowledge no 

such lexicon exists. However, there are a very few efforts 

which have tried to construct lexicons for other language 

processing applications of Urdu text (Ijaz and Hussain, 

2007; Humayoun et al., 2007; Muaz and Hussain, 2009 

and Mukund et al, 2010). 

Therefore, this paper describes the structure, 

construction and evaluation of a manually tagged 

sentiment-annotated Urdu words based lexicon as a 

component of a sentiment analysis model developed for 

Urdu text. The lexicon contains information about the 

subjectivity of an entry in addition to its orthographic, 

phonological, syntactic and, morphological aspects. This 

approach recognizes the subjective entries in the lexicon 

through their two attributes; i.e. orientation (either 

positive or negative) and intensity (the force of the 

orientation). After the development of the lexicon, it is 

integrated with the sentiment classifier. The classifier 

preprocesses the given text and then applies shallow 

218


parsing based chunking. It uses lexicon for comparing all 

the words/phrases present in the text. As a result, all the 

subjective terms in the given text become annotated. On 

the basis of the polarities of individual words, the 

sentence and then its total review polarity is calculated. 

The overall system performance is evaluated by using a 

corpus of movie reviews in Urdu language. The 

classification algorithm is applied on the review corpus. 

Each subjective word in the review is compared with 

lexicon entries for the computation of the polarity scores. 

MATERIAL AND METHODS 

In this section, the construction, structure and 

integration of the sentiment-annotated lexicon of the 

Urdu words developed for a sentiment classification 

model is described. The model is designed to distinguish 

between the objective and subjective terms in a given 

review. Objective terms are with neutral sentiments, 

which have no effect on the final decision of the 

classification and subjective terms are considered as the 

carriers of the sentiments and their presence can alter the 

final classification. Keeping this distinction in view, the 

lexicon entries are also categorized as objective and 

subjective terms. Before going into details, some terms 

are defined below: 

• Orientation. Orientation describes either the 

positivity or the negativity of a lexicon entry. For 

most of the entries, orientation is predefined during 

lexicon construction phase. But, in a given text it can 

be altered with the use of a polarity shifter in the 

sentence, e.g. the word ‏”اچھھھا“‏ (acha, good) have 

positive orientation but, with the polarity shifter “ 

expression, (naheen, not), it becomes a negative ‏”نہیں 

i.e., نہیں“‏ ‏”اچھا (acha naheen, not good). Moreover, 

the orientation of some words (though their number 

is few) is highly domain specific or depends upon the 

context within which they are used. But, these two 

issues are beyond the scope of this research. 

• Intensity. This is the intensity of orientation of a 

lexicon entry. This describes the force of positivity 

or negativity of a term. Usually, the modifiers, e.g., “ 

(bohat, more) describe the intensity of an ‏”بہھھت 

expression. Like other languages, in Urdu there are 

three degrees of intensity; absolute (only positive or 

negative orientation), comparatives (two distinct 

entities are compared with each other) and 

superlative (one of all entities is with highest 

orientation) 

• Polarity. The polarity mark is annotated with each 

lexicon entry to show its orientation and intensity. 

This is done at the implementation level. 

Lexicon Construction: A sentiment-annotated lexicon 

becomes more intricate as compared to other Natural 

Language Processing (NLP) lexicons. There are two 

reasons for this intricacy: 

• Each lexicon entry demonstrates its polarity 

information in addition to its orthographic, 

phonological, syntactic and, morphological features. 

This polarity information is usually represented as 

either positive, or negative or neutral. For example, 

SentiWordNet (Andreevskaia and Bergler, 2006), 

use triplets [positive, negative, objectives], with 

minimum value 0.0 and maximum 1.0. 

• Most of the words exhibit multiple orientations 

depending upon their use and domain. For example, 

“This damage is everlasting”. In this sentence, the 

everlasting is a positive word, but the comment’s 

overall orientation is negative. Also, unpredictable is 

a positive word when used about a movie’s plot, and 

becomes negative for the performance of a 

microwave oven. 

Construction Steps: The lexicon construction task is 

divided into following steps: 

Figure 1. Structure of the sentiment-annotated lexicon 

with respect to O and I 

• Categorize the words either subjective or objective. 

When the classification algorithm is applied on these 

words, then the classifier simply ignores objective 

terms, in this way its performance totally depends 

upon subjective words. 

• Categorize these words according to morphological 

rules, which work at the word level. These rules can 

change the structure, meaning, and part of speech of 

the words. For example, rules for marking of an 

adjective with the noun it qualifies, etc. 

• Identify their grammatical rules, which describe the 

possible structures of a sentence and position of the 

parts of speech with respect to each other. As Urdu is 

a free order language so theses rules are more 

difficult to define and implement. For example, use 

of modifiers with adjectives or use of auxiliaries with 

verbs, etc. 

• Discover relationships between different lexicon 

entries. These relationships can define synonyms, 

antonyms, and cross references, etc. 

219


• Decide and annotate polarities and then intensities to 

the entries. In this task first the entries are 

categorized as positive or negative then their 

intensity scores are attached to them. Some entries 

have only orientations and some have only intensities 

(like modifiers) and some have both values. 

Lexicon Structure: It is assumed that the lexicon entries 

are either subjective or objective. The Objective terms are 

saved without any polarity mark, but the subjective terms 

are further categorized on the bases of orientation and 

intensity into three types as: 

• Terms with orientation only T (O). These are the 

terms which are either absolute positive or absolute 

negative. The degree of positivity or negativity is not 

attached with them. 

• Terms with intensity only T (I). These are the terms 

which have no orientation but they can intensify the 

orientation of other word in the sentences. 

• Terms with both orientation and intensity T (O, I). If 

a term contains both orientation (either positive or 

negative) and intensity then it lies in this category 

and is marked with both values. 

Some examples of lexicon entries from all the three 

categories, i.e., T(O), T(I) and T(I,O) are given in Table 

1. For example, the word ‏”کامیاب“‏ (kamyaab, successful), 

‏”زیادہ“‏ Similarly, has positive orientation but no intensity. 

(zyada, more) and ‏”بہت“‏ (bohat, very) both have intensity 

and no orientation. Whereas, ‏”بہتر“‏ (behtar, better) and “ 

(behtareen, best) both have positive orientation ‏”بہھترین 

with intensities of a comparative and superlative degrees, 

respectively. 

Figure 2. Integration of sentiment annotated lexicon of Urdu words with the sentiment classifier 

System Integration: The annotated lexicon of Urdu 

words is integrated with the sentiment classifier as shown 

in Figure 2. First of all, the given text in the form of a 

review is taken from the website. The sentiment classifier 

component of the systems preprocesses this review, 

segments it into sentences and then words. These words 

are then tagged with the respective parts of speech. Now, 

these tagged words are compared with the lexicon entries 

for sentiment orientations and intensities. This 

comparison results into polarity marked or polarity 

annotated words and phrases. The classifier then 

calculates the sentiment orientation of the sentences using 

term polarities. 

RESULTS AND DISCUSSION 

As already mentioned, the corpuses of reviews 

in Urdu text are not available in the electronic form. 

Although, some other corpuses related to news, blogs are 

accessible but these are not appropriate for the 

experimentation and evaluation of our system because 

these do not contain opinionated text like reviews. 

Therefore, two corpuses are manually collected 

as the test-beds from the domains of movies and 

electronic appliances. These reviews are taken from 

different people to avoid monotonous opinions. The 

movie reviews based corpus MR (movie reviews) is 

comprised of 226 positive, 224 negative and 450 reviews 

in total. There are 328 reviews of electronic appliances in 

PR (product reviews) corpus, with 177 positive and 151 

negative. 

For measuring the performance, accuracy is 

used as the system performance metric. It is the measure 

of how close the document classification suggested by 

our system is to the actual sentiments present in the 

review. A series of experiments is performed on both 

corpora, one after another. 

Table 2, shows the results, with accuracy of 66- 

74% for MR and 77-79% for PR. It also gives the 

variation in the classification of positive and negative 

reviews, separately. 

220


Table 2. Results of experimentation on both corpora 

Category Corpora Accuracy 

Negative 

MR 66% 

PR 77% 

Positive 

MR 74% 

PR 79% 

Conclusions This research work presents, the structure, 

development and integration of a sentiment-annotated 

lexicon, developed as a component of an Urdu text based 

sentiment analysis system. Urdu is a morphologically 

rich language, and hence, poses many challenges for the 

development of such a lexicon. Moreover, due to 

unavailability of electronic text and corpuses of 

opinionated reviews, our task becomes even more time 

consuming. The next step after the development of the 

lexicon is its integration with the sentiment classifier and 

final implementation of the complete system. There are 

two types of corpuses, which are used for testing, i.e., 

movie and product reviews. Despite of the inherent 

complexities of the language, the experimentation gives 

excellent results with an accuracy of about (74%). 

Therefore, it is planned to extend this lexicon on the same 

structure but with larger coverage of words. 

REFERENCES 

Andreevskaia, A. and S. Bergler: Mining WordNet for 

fuzzy sentiment: Sentiment tag extraction from 

WordNet glosses. In: EACL 2006, Trent, Italy, 

(2006). 

Annet, M. and G. Kondark: A comparison of sentiment 

analysis techniques: Polarizing movie blogs. In: 

Bergler, S. (ed.) Canadian AI 2008. LNCS 

(LNAI), vol. 5032, pp. 25–35. Springer, 

Heidelberg, (2008). 

Glaser, J., J. Dixit and P. D. Green: Studying hate crime 

with the Internet: What makes racists advocate 

racial violence, Journal of Social Issues 58, 1, 

177-193, (2002). 

Hatzivassiloglou, V. and J. Wiebe: Effects of Adjective 

Orientation and Gradability on Sentence 

Subjectivity. In: 18th International Conference 

on Computational Linguistics, New Brunswick, 

NJ, (2000). 

Higashinaka, R., M. Walker and R. Prasad: Learning to 

generate naturalistic utterances using reviews in 

spoken dialogue systems. ACM Transactions 

onSpeech and Language Processing (TSLP), 

(2007). 

Hu, M. and B. Lui: Mining and summarizing customer 

reviews. In: Conference on Human Language 

Technology and Empirical Methods in Natural 

Language Processing, (2005). 

Humayoun, M., H. Hammarström, and A. Ranta.: Urdu 

morphology, orthography and lexicon 

extraction. In A. Farghaly and K. 

Megerdoomian (Eds.). In: Proceedings of the 

2nd Workshop on Computational Approaches to 

Arabic Scriptbased Languages, pp. 59–66. 

Stanford LSA (2007). 

Ijaz, M. and S. Hussain: Corpus based Urdu Lexicon 

Development. In: Conference on Language 

Technology (CLT 2007), University of 

Peshawar, Pakistan, (2007). 

Muaz, A., A. Ali and S. Hussain: Analysis and 

Development of Urdu POS Tagged Corpora. In: 

Proceedings of the 7 th Workshop on Asian 

Language Resources, IJCNLP, (2009). 

Mukund, S., D. Ghosh and R. K. Srihari: Using Cross- 

Lingual Projections to Generate semantic Role 

Labeled Corpus for Urdu- A Resource Poor 

Language. In: 23 rd International Conference on 

Computational Linguistics COLING, (2010). 

Pang, B. and L. Lee: Opinion mining and sentiment 

analysis. Foundation and Trends in Information 

Retrieval 2(1-2), 1–135, (2008). 

Riloff, E., J. Wiebe and T. Wilson: Learning subjective 

nouns using extraction pattern bootstrapping. In 

Proceedings of the Conference on Natural 

Language Learning (CoNLL), pp. 25–32, 

(2003). 

Turney, P.: Thumbs up or thumbs down Semantic 

orientation applied to unsupervised classification 

of reviews, in Proceedings of the Association for 

Computational Linguistics (ACL), pp. 417–424, 

(2002). 

Yu, H. and V. Hatzivassiloglou: Towards answering 

opinion questions: Separating facts from 

opinions and identifying the polarity of opinion 

sentences. In Proceedings of the Conference on 

Empirical Methods in Natural Language 

Processing (EMNLP), (2003). 

221

sentiment-annotated lexicon construction for an urdu ... - Paas.com.pk

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?