sentiment-annotated lexicon construction for an urdu ... - Paas.com.pk

More documents

Recommendations

Info

Pakistan Journal of Science (Vol. 63 No. 4 Dec, 2011) parsing based chunking. It uses lexicon for comparing all the words/phrases present in the text. As a result, all the subjective terms in the given text become annotated. On the basis of the polarities of individual words, the sentence and then its total review polarity is calculated. The overall system performance is evaluated by using a corpus of movie reviews in Urdu language. The classification algorithm is applied on the review corpus. Each subjective word in the review is compared with lexicon entries for the computation of the polarity scores. MATERIAL AND METHODS In this section, the construction, structure and integration of the sentiment-annotated lexicon of the Urdu words developed for a sentiment classification model is described. The model is designed to distinguish between the objective and subjective terms in a given review. Objective terms are with neutral sentiments, which have no effect on the final decision of the classification and subjective terms are considered as the carriers of the sentiments and their presence can alter the final classification. Keeping this distinction in view, the lexicon entries are also categorized as objective and subjective terms. Before going into details, some terms are defined below: • Orientation. Orientation describes either the positivity or the negativity of a lexicon entry. For most of the entries, orientation is predefined during lexicon construction phase. But, in a given text it can be altered with the use of a polarity shifter in the sentence, e.g. the word ‏”اچھھھا“‏ (acha, good) have positive orientation but, with the polarity shifter “ expression, (naheen, not), it becomes a negative ‏”نہیں i.e., نہیں“‏ ‏”اچھا (acha naheen, not good). Moreover, the orientation of some words (though their number is few) is highly domain specific or depends upon the context within which they are used. But, these two issues are beyond the scope of this research. • Intensity. This is the intensity of orientation of a lexicon entry. This describes the force of positivity or negativity of a term. Usually, the modifiers, e.g., “ (bohat, more) describe the intensity of an ‏”بہھھت expression. Like other languages, in Urdu there are three degrees of intensity; absolute (only positive or negative orientation), comparatives (two distinct entities are compared with each other) and superlative (one of all entities is with highest orientation) • Polarity. The polarity mark is annotated with each lexicon entry to show its orientation and intensity. This is done at the implementation level. Lexicon Construction: A sentiment-annotated lexicon becomes more intricate as compared to other Natural Language Processing (NLP) lexicons. There are two reasons for this intricacy: • Each lexicon entry demonstrates its polarity information in addition to its orthographic, phonological, syntactic and, morphological features. This polarity information is usually represented as either positive, or negative or neutral. For example, SentiWordNet (Andreevskaia and Bergler, 2006), use triplets [positive, negative, objectives], with minimum value 0.0 and maximum 1.0. • Most of the words exhibit multiple orientations depending upon their use and domain. For example, “This damage is everlasting”. In this sentence, the everlasting is a positive word, but the comment’s overall orientation is negative. Also, unpredictable is a positive word when used about a movie’s plot, and becomes negative for the performance of a microwave oven. Construction Steps: The lexicon construction task is divided into following steps: Figure 1. Structure of the sentiment-annotated lexicon with respect to O and I • Categorize the words either subjective or objective. When the classification algorithm is applied on these words, then the classifier simply ignores objective terms, in this way its performance totally depends upon subjective words. • Categorize these words according to morphological rules, which work at the word level. These rules can change the structure, meaning, and part of speech of the words. For example, rules for marking of an adjective with the noun it qualifies, etc. • Identify their grammatical rules, which describe the possible structures of a sentence and position of the parts of speech with respect to each other. As Urdu is a free order language so theses rules are more difficult to define and implement. For example, use of modifiers with adjectives or use of auxiliaries with verbs, etc. • Discover relationships between different lexicon entries. These relationships can define synonyms, antonyms, and cross references, etc. 219
Pakistan Journal of Science (Vol. 63 No. 4 Dec, 2011) • Decide and annotate polarities and then intensities to the entries. In this task first the entries are categorized as positive or negative then their intensity scores are attached to them. Some entries have only orientations and some have only intensities (like modifiers) and some have both values. Lexicon Structure: It is assumed that the lexicon entries are either subjective or objective. The Objective terms are saved without any polarity mark, but the subjective terms are further categorized on the bases of orientation and intensity into three types as: • Terms with orientation only T (O). These are the terms which are either absolute positive or absolute negative. The degree of positivity or negativity is not attached with them. • Terms with intensity only T (I). These are the terms which have no orientation but they can intensify the orientation of other word in the sentences. • Terms with both orientation and intensity T (O, I). If a term contains both orientation (either positive or negative) and intensity then it lies in this category and is marked with both values. Some examples of lexicon entries from all the three categories, i.e., T(O), T(I) and T(I,O) are given in Table 1. For example, the word ‏”کامیاب“‏ (kamyaab, successful), ‏”زیادہ“‏ Similarly, has positive orientation but no intensity. (zyada, more) and ‏”بہت“‏ (bohat, very) both have intensity and no orientation. Whereas, ‏”بہتر“‏ (behtar, better) and “ (behtareen, best) both have positive orientation ‏”بہھترین with intensities of a comparative and superlative degrees, respectively. Figure 2. Integration of sentiment annotated lexicon of Urdu words with the sentiment classifier System Integration: The annotated lexicon of Urdu words is integrated with the sentiment classifier as shown in Figure 2. First of all, the given text in the form of a review is taken from the website. The sentiment classifier component of the systems preprocesses this review, segments it into sentences and then words. These words are then tagged with the respective parts of speech. Now, these tagged words are compared with the lexicon entries for sentiment orientations and intensities. This comparison results into polarity marked or polarity annotated words and phrases. The classifier then calculates the sentiment orientation of the sentences using term polarities. RESULTS AND DISCUSSION As already mentioned, the corpuses of reviews in Urdu text are not available in the electronic form. Although, some other corpuses related to news, blogs are accessible but these are not appropriate for the experimentation and evaluation of our system because these do not contain opinionated text like reviews. Therefore, two corpuses are manually collected as the test-beds from the domains of movies and electronic appliances. These reviews are taken from different people to avoid monotonous opinions. The movie reviews based corpus MR (movie reviews) is comprised of 226 positive, 224 negative and 450 reviews in total. There are 328 reviews of electronic appliances in PR (product reviews) corpus, with 177 positive and 151 negative. For measuring the performance, accuracy is used as the system performance metric. It is the measure of how close the document classification suggested by our system is to the actual sentiments present in the review. A series of experiments is performed on both corpora, one after another. Table 2, shows the results, with accuracy of 66- 74% for MR and 77-79% for PR. It also gives the variation in the classification of positive and negative reviews, separately. 220
Page 1: Pakistan Journal of Science (Vol. 6

sentiment-annotated lexicon construction for an urdu ... - Paas.com.pk

Create successful ePaper yourself

Delete template?

Save as template?