29.01.2015 Views

sentiment-annotated lexicon construction for an urdu ... - Paas.com.pk

sentiment-annotated lexicon construction for an urdu ... - Paas.com.pk

sentiment-annotated lexicon construction for an urdu ... - Paas.com.pk

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Pakist<strong>an</strong> Journal of Science (Vol. 63 No. 4 Dec, 2011)<br />

parsing based chunking. It uses <strong>lexicon</strong> <strong>for</strong> <strong>com</strong>paring all<br />

the words/phrases present in the text. As a result, all the<br />

subjective terms in the given text be<strong>com</strong>e <strong><strong>an</strong>notated</strong>. On<br />

the basis of the polarities of individual words, the<br />

sentence <strong>an</strong>d then its total review polarity is calculated.<br />

The overall system per<strong>for</strong>m<strong>an</strong>ce is evaluated by using a<br />

corpus of movie reviews in Urdu l<strong>an</strong>guage. The<br />

classification algorithm is applied on the review corpus.<br />

Each subjective word in the review is <strong>com</strong>pared with<br />

<strong>lexicon</strong> entries <strong>for</strong> the <strong>com</strong>putation of the polarity scores.<br />

MATERIAL AND METHODS<br />

In this section, the <strong>construction</strong>, structure <strong>an</strong>d<br />

integration of the <strong>sentiment</strong>-<strong><strong>an</strong>notated</strong> <strong>lexicon</strong> of the<br />

Urdu words developed <strong>for</strong> a <strong>sentiment</strong> classification<br />

model is described. The model is designed to distinguish<br />

between the objective <strong>an</strong>d subjective terms in a given<br />

review. Objective terms are with neutral <strong>sentiment</strong>s,<br />

which have no effect on the final decision of the<br />

classification <strong>an</strong>d subjective terms are considered as the<br />

carriers of the <strong>sentiment</strong>s <strong>an</strong>d their presence c<strong>an</strong> alter the<br />

final classification. Keeping this distinction in view, the<br />

<strong>lexicon</strong> entries are also categorized as objective <strong>an</strong>d<br />

subjective terms. Be<strong>for</strong>e going into details, some terms<br />

are defined below:<br />

• Orientation. Orientation describes either the<br />

positivity or the negativity of a <strong>lexicon</strong> entry. For<br />

most of the entries, orientation is predefined during<br />

<strong>lexicon</strong> <strong>construction</strong> phase. But, in a given text it c<strong>an</strong><br />

be altered with the use of a polarity shifter in the<br />

sentence, e.g. the word ‏”اچھھھا“‏ (acha, good) have<br />

positive orientation but, with the polarity shifter “<br />

expression, (naheen, not), it be<strong>com</strong>es a negative ‏”نہیں<br />

i.e., نہیں“‏ ‏”اچھا (acha naheen, not good). Moreover,<br />

the orientation of some words (though their number<br />

is few) is highly domain specific or depends upon the<br />

context within which they are used. But, these two<br />

issues are beyond the scope of this research.<br />

• Intensity. This is the intensity of orientation of a<br />

<strong>lexicon</strong> entry. This describes the <strong>for</strong>ce of positivity<br />

or negativity of a term. Usually, the modifiers, e.g., “<br />

(bohat, more) describe the intensity of <strong>an</strong> ‏”بہھھت<br />

expression. Like other l<strong>an</strong>guages, in Urdu there are<br />

three degrees of intensity; absolute (only positive or<br />

negative orientation), <strong>com</strong>paratives (two distinct<br />

entities are <strong>com</strong>pared with each other) <strong>an</strong>d<br />

superlative (one of all entities is with highest<br />

orientation)<br />

• Polarity. The polarity mark is <strong><strong>an</strong>notated</strong> with each<br />

<strong>lexicon</strong> entry to show its orientation <strong>an</strong>d intensity.<br />

This is done at the implementation level.<br />

Lexicon Construction: A <strong>sentiment</strong>-<strong><strong>an</strong>notated</strong> <strong>lexicon</strong><br />

be<strong>com</strong>es more intricate as <strong>com</strong>pared to other Natural<br />

L<strong>an</strong>guage Processing (NLP) <strong>lexicon</strong>s. There are two<br />

reasons <strong>for</strong> this intricacy:<br />

• Each <strong>lexicon</strong> entry demonstrates its polarity<br />

in<strong>for</strong>mation in addition to its orthographic,<br />

phonological, syntactic <strong>an</strong>d, morphological features.<br />

This polarity in<strong>for</strong>mation is usually represented as<br />

either positive, or negative or neutral. For example,<br />

SentiWordNet (Andreevskaia <strong>an</strong>d Bergler, 2006),<br />

use triplets [positive, negative, objectives], with<br />

minimum value 0.0 <strong>an</strong>d maximum 1.0.<br />

• Most of the words exhibit multiple orientations<br />

depending upon their use <strong>an</strong>d domain. For example,<br />

“This damage is everlasting”. In this sentence, the<br />

everlasting is a positive word, but the <strong>com</strong>ment’s<br />

overall orientation is negative. Also, unpredictable is<br />

a positive word when used about a movie’s plot, <strong>an</strong>d<br />

be<strong>com</strong>es negative <strong>for</strong> the per<strong>for</strong>m<strong>an</strong>ce of a<br />

microwave oven.<br />

Construction Steps: The <strong>lexicon</strong> <strong>construction</strong> task is<br />

divided into following steps:<br />

Figure 1. Structure of the <strong>sentiment</strong>-<strong><strong>an</strong>notated</strong> <strong>lexicon</strong><br />

with respect to O <strong>an</strong>d I<br />

• Categorize the words either subjective or objective.<br />

When the classification algorithm is applied on these<br />

words, then the classifier simply ignores objective<br />

terms, in this way its per<strong>for</strong>m<strong>an</strong>ce totally depends<br />

upon subjective words.<br />

• Categorize these words according to morphological<br />

rules, which work at the word level. These rules c<strong>an</strong><br />

ch<strong>an</strong>ge the structure, me<strong>an</strong>ing, <strong>an</strong>d part of speech of<br />

the words. For example, rules <strong>for</strong> marking of <strong>an</strong><br />

adjective with the noun it qualifies, etc.<br />

• Identify their grammatical rules, which describe the<br />

possible structures of a sentence <strong>an</strong>d position of the<br />

parts of speech with respect to each other. As Urdu is<br />

a free order l<strong>an</strong>guage so theses rules are more<br />

difficult to define <strong>an</strong>d implement. For example, use<br />

of modifiers with adjectives or use of auxiliaries with<br />

verbs, etc.<br />

• Discover relationships between different <strong>lexicon</strong><br />

entries. These relationships c<strong>an</strong> define synonyms,<br />

<strong>an</strong>tonyms, <strong>an</strong>d cross references, etc.<br />

219

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!