27.11.2014 Views

software quality metrics for rhetorical structure theory ... - Paas.com.pk

software quality metrics for rhetorical structure theory ... - Paas.com.pk

software quality metrics for rhetorical structure theory ... - Paas.com.pk

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

SOFTWARE QUALITY METRICS FOR RHETORICAL STRUCTURE THEORY<br />

BASED INFORMATION RETRIEVAL SYSTEMS<br />

M. Shoaib, S. Arshad, S. Jabeen and M. P. Tariq *<br />

Department of Computer Science & Engineering, Lahore, Pakistan, * Department of CS, Virtual University of<br />

Pakistan<br />

Corresponding author: tariq_cp@hotmail.<strong>com</strong><br />

ABSTRACT: Currently, several in<strong>for</strong>mation retrieval models exist, <strong>for</strong> example, Boolean<br />

model, vector space model, probabilistic model etc. but none of them provides exact in<strong>for</strong>mation<br />

required to the user. The search engines based on such models emphasize on the syntactic aspects<br />

of the query rather than capturing the semantics. Most of the work done in the domain of<br />

in<strong>for</strong>mation retrieval is based on the keyword based indexing technique. Searching based on this<br />

keyword based indexing technique leads to an irrelevant result set and poor per<strong>for</strong>mance of search<br />

engines. The world of in<strong>for</strong>mation retrieval is revolutionized by the <strong>rhetorical</strong> <strong>structure</strong> <strong>theory</strong><br />

because it improvised the per<strong>for</strong>mance of search engines by introducing the semantic based<br />

retrieval, hence, reducing the irrelevant search to its minimum. In this paper, we present and<br />

validate design metric <strong>for</strong> <strong>rhetorical</strong> <strong>structure</strong> <strong>theory</strong> based in<strong>for</strong>mation retrieval systems so as to<br />

measure the relevance retrieval and to improvise the retrieval per<strong>for</strong>mance.<br />

Keywords: Keyword based indexing, dynamic weight, text based in<strong>for</strong>mation retrieval, tokens, discourse <strong>structure</strong><br />

INTRODUCTION<br />

In Text Based In<strong>for</strong>mation Retrieval (TBIR)<br />

systems, when the user enters a query ‘q’ against a<br />

collection of documents ‘c’ then each document ‘d’ is<br />

examined and is given a weight according to the criteria<br />

of how well it satisfies the semantics of the query ‘q’ or<br />

in other words, how well requirement of the user is<br />

fulfilled. For every instance of triple , the<br />

weight assignment attributed to a document ‘d’ is done<br />

by a function evaluation. Generally speaking,<br />

an In<strong>for</strong>mation Retrieval (IR) system (Arnt et al., 2004<br />

and Petratos, 2006) per<strong>for</strong>ms various retrieval tasks<br />

such as document indexing, ranking and classification.<br />

Document ranking is achieved by sorting all documents<br />

in the collection on the basis of assigned weights<br />

(Pa<strong>pk</strong>a et al., 2008). The core job of an IR system is the<br />

parsing process. The tokens produced in the parsing of<br />

documents and queries are called terms. The query<br />

provided to the IR system in natural language is parsed<br />

into a set of terms. The terms derived from a document<br />

are used to build an inverted list which is used as an<br />

index to the document collection. Normally, in IR<br />

systems it is assumed that if there is co-occurrence of a<br />

term in the query and a document then the document is<br />

relevant to the query. Also the co-occurrences of<br />

multiple terms of a query in a document contribute to<br />

the degree of relevance of that document. Current<br />

in<strong>for</strong>mation retrieval techniques (Liu et al., 2009) are<br />

able to retrieve only 30% of the relevant in<strong>for</strong>mation.<br />

Currently, the in<strong>for</strong>mation retrieval systems are based<br />

on key-based indexing technique in which keywords<br />

are assigned static weights using some retrieval models<br />

such as extended Boolean Model, Vector Model or<br />

Probabilistic Model etc. The words carrying different<br />

meaning in various contexts can result in retrieval of<br />

irrelevant in<strong>for</strong>mation (Shoaib and Shah, 2006).<br />

Rhetorical Structure Theory (RST) (Mann and<br />

Thompson, 1986, Mann and Thompson, 1987, Mathkour,<br />

et al., 2008, Taboada and Mann, 2006) is considered as a<br />

descriptive linguistic approach to identify the textual<br />

relations with the intention of generating text. The success<br />

of RST can be seen from the wide application of this <strong>theory</strong><br />

in various domains ranging from discourse analysis to<br />

psycholinguistics, theoretical linguistics and <strong>com</strong>putational<br />

linguistics (Jun'chi and Tsujii, 1994, Mathkour, et al.,<br />

2008, Taboada and Mann, 2005). This <strong>theory</strong> is remarkably<br />

used in various applications of <strong>com</strong>putational linguistics,<br />

<strong>for</strong> example, text parsing and generation, machine<br />

translation and easy scoring and most importantly in<br />

natural language processing (Mann and Thompson, 1987,<br />

Marcu, 1997).This <strong>theory</strong> organizes the text by means of<br />

relations that exist between various sections of the text and<br />

establishes a coherency in which every section of the text<br />

plays a role or function with respect to other sections of the<br />

text. The resulting relations are termed as coherence<br />

relations, discourse relations, conjunction relations and<br />

Rhetorical Relations. There may be 30 different relations in<br />

various sections of text (Taboada, and Mann, 2005,<br />

Taboada, and Mann, 2006). RST identifies two different<br />

types of text units: The Nucleus and the Satellites. Nuclei<br />

are of prime importance whereas the satellites contribute to<br />

the nucleus and are considered secondary. In elaboration<br />

relation the nucleus is the part of the text containing basic<br />

in<strong>for</strong>mation and satellites are parts of the text containing<br />

177


Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

additional or supporting in<strong>for</strong>mation about the nucleus.<br />

RST relations on the text are applied recursively until<br />

all text units are associated with one of the RST<br />

relations (Taboada, and Mann, 2006).<br />

With the gigantic increase in the volume of the<br />

web contents, it has be<strong>com</strong>e a burning need of the day<br />

to improvise the per<strong>for</strong>mance of the search engines so<br />

that they may be able to retrieve the semantically<br />

relevant in<strong>for</strong>mation in response to the user query.<br />

Currently most of the search engines are based on the<br />

keyword based searching techniques which are<br />

considered to be static due to the reliance on the syntax<br />

only. These IR systems are unable to handle the various<br />

contexts in which the same term appears in different<br />

documents. RST based IR system cope up with this<br />

problem by capturing the semantics of the terms using<br />

the dynamic weight assignment techniques.<br />

In this paper, we present various design<br />

<strong>metrics</strong> <strong>for</strong> RST based IR systems to measure the<br />

retrieval relevancy and improve per<strong>for</strong>mance of the IR<br />

systems. We also present validation of the proposed<br />

<strong>metrics</strong> by taking various case studies. We have made<br />

base of our research work the architecture of RST based<br />

IR system presented in (Shoaib and Shah, 2006). Case<br />

studies to validate the proposed design <strong>metrics</strong> have<br />

also been taken from the work presented by (Shoaib<br />

and Shah, 2006).<br />

The rest of the paper is organized as follows:<br />

In section 2, we describe the architecture of RST based<br />

IR system based upon which we present the design<br />

<strong>metrics</strong> and their validation. In section 3, we present<br />

materials and methods. In section 4, we present results<br />

and discussions and in section 5 we conclude our work<br />

and present future directions.<br />

Architecture of RST based IR system: Since, we<br />

have selected the architecture of RST based IR system<br />

proposed by Shoaib and Shah (2006) as a base of our<br />

research work, there<strong>for</strong>e, in this section we describe this<br />

architecture in detail. They proposed an indexing<br />

technique with dynamic weight assignment technique<br />

based on RST. They used the punctuation marks and<br />

cue phrases (Marcu, 2000) (words that connect two or<br />

more text spans) to define the <strong>rhetorical</strong> relations. They<br />

constructed RST tree whose leaves represent the text<br />

spans and the internal nodes represent the <strong>rhetorical</strong><br />

relations. Keywords extracted from the text spans are<br />

assigned dynamic weight such that the keywords in the<br />

text span closer to nucleus are assigned priority<br />

weights. With this dynamic weight assignment<br />

technique the precision rate has been clearly<br />

improvised. The system developed on the basis of this<br />

approach is able to capture the semantics of a document<br />

in an efficient way. The architecture (given by Shoaib<br />

and Shah, 2006) of their proposed model consists of<br />

four different modules: 1) Segmentor 2) Rhetorical<br />

Relation Finder 3) Rhetorical Tree Parser and 3) RST<br />

based Indexer (fig. 1).<br />

Fig-1: Architecture of RST based IR systems proposed<br />

by Shoaib and Shah (2006)<br />

The purpose and functionality of each <strong>com</strong>ponent<br />

is defined as followed.<br />

Segmentor: Text <strong>structure</strong> is usually organized into words,<br />

sentences and paragraphs. There is a correlation between<br />

these different text spans which is needed to be identified.<br />

To handle the <strong>structure</strong> in the text there is a need to identify<br />

the boundaries in the <strong>structure</strong>. This is the primary function<br />

of text segmentation. Text segmentor provides the<br />

structural in<strong>for</strong>mation which is utilized <strong>for</strong> text analysis.<br />

Rhetorical relation finder: In the next step the indexer<br />

automatically finds out the <strong>rhetorical</strong> relations. For this<br />

purpose, concept of cue phrases and punctuation is used to<br />

get the relationship between every text span.<br />

RST tree: From the list of text spans and the list of<br />

relations generated in the previous stages Rhetorical Tree is<br />

generated whose nodes represent the relations and leaves<br />

represent the text spans. Based on the height of the tree,<br />

initial weight is assigned to text spans. The indexer utilizes<br />

the keywords and their corresponding initial/dynamic<br />

weight to populate its knowledge base.<br />

Indexer: Indexer uses the concept of strong and weak<br />

nodes to assess the initial weight. This initial weight<br />

assessment is used <strong>for</strong> dynamic weight assignment. The<br />

initial weight is taken between 0 and 1. Root node is<br />

assigned 1 whereas the nucleus and satellite to the parent<br />

node are assigned 0.9 and 0.5 respectively. These weights<br />

are variable. The weight assigned to child node is<br />

generalized by the following <strong>for</strong>mula. Initial weight of<br />

child node = Weight of child node * Weight of parent<br />

node. After calculating the initial weight of all children<br />

nodes these weights are associated with the index terms<br />

existing in the text spans. On the basis of initial weight<br />

178


Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

assessment and term frequency, the actual weight of<br />

index terms is calculated as follows:<br />

Actual weight of the index term = Initial<br />

weight assessment* Term frequency. Indexer maintains<br />

the knowledge base by saving the document ID,<br />

vocabulary ID and actual weights. Document ID keeps<br />

track of which word belongs to which document.<br />

Vocabulary ID assures that there is no redundancy of<br />

words in the knowledge base thereby consuming less<br />

space. Dynamic weight indicates the semantic based<br />

occurrence of important index terms.<br />

MATERIALS AND METHODS<br />

In this section, we propose a number of<br />

<strong>metrics</strong> related to following basic <strong>com</strong>ponents of RST<br />

based IR system.<br />

1. Segmentor<br />

2. Rhetorical relation finder<br />

3. Rhetorical tree parser<br />

4. RST based indexer<br />

Number of segments metric: Number of Segments<br />

(NoS) is one of the proposed <strong>metrics</strong> to improvise<br />

per<strong>for</strong>mance of IR systems. It is <strong>for</strong> segmentor that is<br />

one of basic <strong>com</strong>ponents of RST based IR systems.<br />

This metric calculates the total number of segments into<br />

which a text document can be divided. This metric is<br />

defined by taking the summation of no. of cue phrases<br />

and punctuation marks identified (since these two<br />

parameters are used to divide a text into various<br />

segments). A segment is defined as any elementary unit<br />

such as a keyword, a line of text or a paragraph itself.<br />

There is no way defined yet to validate the<br />

correct number of segments in advance. The output of<br />

the first <strong>com</strong>ponent of the architecture is the list of the<br />

individual segments. This list is further fed into the next<br />

<strong>com</strong>ponent which is the <strong>rhetorical</strong> relation finder. This<br />

<strong>com</strong>ponent works on the list of segments to identify the<br />

<strong>rhetorical</strong> relations correspondingly. The benefit of<br />

<strong>com</strong>puting the no. of segments in advance gives the<br />

exact idea of how much memory will be required to<br />

store the segments. And also, the no. of segments will<br />

be helpful to judge the <strong>structure</strong> of the RST tree in<br />

advance and hence the corresponding <strong>rhetorical</strong><br />

relations.<br />

Mathematically, the No. of Segments metric is<br />

defined as follows:<br />

Rhetorical relation distance metric: There can be two<br />

ways to measure the distance between two text spans. First<br />

method is to calculate the physically linear distance<br />

between two text spans. This syntactically linear distance<br />

means the no. of consecutive sentences that occur between<br />

two text spans. The other way is to <strong>com</strong>pute the semantic<br />

distance between text spans. This distance can be<br />

determined by <strong>com</strong>puting the <strong>rhetorical</strong> relation distance<br />

(RRD) metric. This metric determines the semantic<br />

distance between two text spans in terms of no. of internal<br />

nodes of a discourse graph. The nodes of the graph<br />

represent the text spans. The <strong>com</strong>putation of <strong>rhetorical</strong><br />

distance helps in analyzing the strength of a <strong>rhetorical</strong><br />

relation. As greater the value of the RRD metric, weaker<br />

would be the <strong>rhetorical</strong> relation and vice versa. Hence,<br />

RRD = 0 means strongest relation (Direct relation)<br />

=1 means good relation (Indirect relation)<br />

>1 means weak relation (Indirect relation)<br />

A weaker <strong>rhetorical</strong> relation indicates that there is<br />

an indirect relation between two nodes where each node is<br />

representing a text span. A Rhetorical relation will be<br />

considered stronger if it is a direct relation between two<br />

nodes. Rhetorical relation strength can help in the<br />

construction and improvisation of the <strong>rhetorical</strong> <strong>structure</strong><br />

tree.<br />

Suppose there are two nodes in the RST graph N i and N j<br />

and the level of any node is represented as L with the<br />

indexes i and j representing the specific level number. We<br />

assume that index Lj always indicate a deeper level as<br />

<strong>com</strong>pared to level Li. Index k represents the particular node<br />

number at a specific level then mathematically the<br />

<strong>rhetorical</strong> relation distance between these two nodes will be<br />

Rhetorical Relation Distance (RRD) = L j (Nj k ) -<br />

L i+1 (Ni k ).<br />

Where,<br />

NoC = Number of Cue Phrases identified<br />

NoP = Number of Punctuation Marks Identified<br />

Where Lx>2 and x represents the number of the level<br />

At various depths the node having non sibling parents at<br />

same levels, the RRD will be <strong>com</strong>puted as:<br />

L x = x + (x-1)<br />

L1 = 1 + 0 = 1 case 1[<strong>com</strong>mon parent]<br />

179


Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

= 3 + 2 = 5 case 3[non-<br />

L2<br />

2[sibling parent]<br />

L3<br />

sibling parent]<br />

L4<br />

And so on…<br />

= 2 + 1 = 3 case<br />

= 4 + 3 = 7 As Above<br />

Metrics <strong>for</strong> indexer: This section defines some<br />

important measurements <strong>for</strong> the indexer <strong>com</strong>ponent of<br />

the architecture.<br />

Indexer size metric: There are various measures of the<br />

size of indexer. Some researchers have shown that the<br />

size of indexer is taken in terms of number of web<br />

pages crawled, while others emphasize on how much<br />

<strong>com</strong>puter storage is required to support the index <strong>for</strong><br />

measuring the indexer size(Brown, 1995, Henzinger, et<br />

al., 1999).Our proposed indexer size metric adopts a<br />

different approach. We have measured the size of the<br />

indexer in terms of the number of key terms taken from<br />

various documents that exist in a document corpse and<br />

stored in the indexer as index terms.<br />

So the indexer size metric (S_idx ) is defined<br />

as, “The sum of index terms taken from individual<br />

documents of the document collection”.<br />

Mathematically,<br />

Where‘d’ represents a particular document in<br />

the document corpse ‘D’. T_idx j represents the index<br />

term taken from a particular document and j is the<br />

counter <strong>for</strong> total number of index terms taken from a<br />

particular document.<br />

Indexer term relevancy metric: The efficacy of the<br />

indexer mainly depends upon the degree of relevancy of<br />

the index terms. We define the index term relevancy<br />

(ITR) metric as<br />

“The degree of relevancy of the index term<br />

according to the various context in which the term is<br />

used in different documents”<br />

Indexer ensures the semantically relevant<br />

results by assigning the initial weights to the nodes of<br />

the RST tree. Then it <strong>com</strong>putes the dynamic weights of<br />

the index terms based on the initial weights and term<br />

frequency. Mathematically,<br />

Where TF i indicates the term frequency of an<br />

index term and W_init i represents the initial weight of<br />

an index term.<br />

User query relevancy metric: User Query Relevancy<br />

(UQR) metric can be defined as<br />

“Corresponding to a user’s query how much relevant<br />

results are returned by indexer”.<br />

The UQR can be <strong>com</strong>puted in terms of the major<br />

operation per<strong>for</strong>med in the indexer searching process. This<br />

is the process of finding the index terms with maximum<br />

dynamic weight so we <strong>com</strong>pute this major operation using<br />

the max() function. When the user enters a query in natural<br />

language <strong>for</strong>m then the max() function is <strong>com</strong>puted against<br />

each term in the user entered query to search out the most<br />

relevant index terms (terms having maximum dynamic<br />

weights).Mathematically<br />

Quality of indexer metric: Another important metric that<br />

is useful <strong>for</strong> measuring the effectiveness of the indexer is<br />

the Quality of Indexer (QI) metric. When a user enters a<br />

query <strong>for</strong> which there can be multiple answers to choose<br />

from, then the most important issue related to indexer<br />

effectiveness is whether it indexes the in<strong>for</strong>mation that are<br />

more useful from user perspective or not. Another reason<br />

<strong>for</strong> considering the <strong>quality</strong> of the indexer is the fact that the<br />

size of indexer is growing at a slower rate as <strong>com</strong>pared to<br />

the rate of increase in amount of the web contents. Due to<br />

storage and processing power limitations of indexer it<br />

be<strong>com</strong>es an important issue <strong>for</strong> the indexer to index the<br />

most useful and relevant documents. In addition to this, the<br />

search engines should index the useful documents in some<br />

ranked order so that the user may get the most relevant<br />

in<strong>for</strong>mation rather than getting too many results most of<br />

which are irrelevant to the user query. Hence, the indexer<br />

with high <strong>quality</strong> indexes is the growing need of the users.<br />

Q(I) in a broader perspective can be considered at two tiers:<br />

a. The overall <strong>quality</strong> of the documents indexed<br />

b. The average <strong>quality</strong> per indexed document<br />

Suppose every document indexed by the search<br />

engine is given a weight w (d). For convenience, we<br />

assume the weights are scaled so that the sum of all<br />

weights is 1.If we refer I to the indexer and │D│ to the set<br />

of all documents indexed by a search engine, then we<br />

define the <strong>quality</strong> of the indexer Q(I) as<br />

Note that since w is scaled, we always have 0 ≤ Q(I) ≥ 1 .<br />

Let │D│ indicates the set of documents indexed by I then<br />

the average <strong>quality</strong> per indexed document is:<br />

Q avg (I) = Q(I) / │D│<br />

This average <strong>quality</strong> per indexed document is a very good<br />

measure of how well an indexer selects documents <strong>for</strong><br />

indexing purpose.<br />

RESULTS AND DISCUSSIONS<br />

In this section we validate the various <strong>metrics</strong><br />

defined <strong>for</strong> the architecture of RST based IR systems:<br />

180


Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

Validation of NoS metric: We validate Number of<br />

Segments (NoS) <strong>for</strong> the following documents.<br />

Table- 1. Documents taken as a sample text used by<br />

Shoaib and Shah (2006).<br />

We have taken two documents (Doc 0 and Doc 1) as<br />

shown in table 1.<br />

For Document 0: NoC = 1 NoP = 3<br />

For Document 1: NoC = 8 NoP = 4<br />

Then the total number of segments will be <strong>com</strong>puted as<br />

follows:<br />

= [([1 + 3] + [8 + 4])]<br />

= [(4 + 12)]<br />

= 16<br />

The same no. of segments are shown by (Shoaib and<br />

Shah, 2006) which is in accordance to our NoS metric.<br />

Validation of RRD metric: Taking an example as a<br />

case study from the work presented by (Bosma, 2005)<br />

the discourse <strong>structure</strong> shown in figure 2 is converted to<br />

the RST graph in figure 3.<br />

The nodes of the RST graph in figure 3 represent<br />

the text spans in terms of nuclei and satellites. The arrows<br />

represent the relation between the nucleus and the satellite<br />

with the head pointing from the nucleus to the satellite.<br />

From this RST graph the RRD between the node 3C and<br />

3A can be calculated as the<br />

RRD between 3C and 3A = L j (Nj k ) - L i+1 (Ni k )<br />

= 2 0 – (0+1) 0<br />

= 1<br />

Hence, the number of internal nodes between 3C<br />

and 3A is 1 which shows that the <strong>rhetorical</strong> relation is good<br />

between 3C and 3A.<br />

Whereas,<br />

RRD between 3C and 3D = L j (Nj k ) - L i+1 (Ni k )<br />

= 1 0 – (0+1) 0<br />

= 0<br />

Hence, the <strong>rhetorical</strong> relation is direct or strong<br />

between 3C and 3D<br />

Exceptional Cases<br />

Case No. 1<br />

If two nodes at same level have a <strong>com</strong>mon parent<br />

node then the RRD will always be equal to 1. So, <strong>for</strong> nodes<br />

3D and 3B in figure 4, RRD between 3B and 3D = 1<br />

Fig-4: RST Graph 2<br />

Fig- 2: Discourse Structure taken from (Bosma,<br />

2005).<br />

Case No. 2<br />

The second exceptional case states that the RRD<br />

between two nodes 3F and 3A (figure 4) having sibling<br />

nodes as parent, will always be equal to 3. Hence<br />

RRD between 3F and 3A = 3<br />

Fig- 3: RST Graph 1<br />

Fig- 5: RST Graph 3<br />

181


Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

Table-2. Indexed terms in the knowledge base used by Shoaib and Shah (2006)<br />

182


Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

Fig- 6: Top ten results returned by the Google search engine<br />

Case No.3<br />

The third exceptional case states that the RRD<br />

between two nodes 3H and 3E (figure 5) existing at<br />

level 3 and having non-sibling nodes as parent, will be<br />

RRD between 3H and 3E:<br />

L x = x + (x-1)<br />

L3 = 3 + 2<br />

= 5<br />

Hence, the number of internal nodes between<br />

3H and 3E is 5, which indicates that the relation<br />

between node 3H and 3E is indirect and weak.<br />

Validation of S_idx metric: We suppose that the<br />

document corpse consists of two documents as shown<br />

in table 1. The size of indexer is calculated as follows:<br />

We have │D│ = 2<br />

For Doc 0:<br />

D0(T_idx) = 19<br />

For doc 1:<br />

d1(T_idx) = 21<br />

S_idx = 0 + [ 19 + 21 ]<br />

S_idx = 0 + 40<br />

S_idx = 40<br />

Hence, the size of indexer is 40 index terms<br />

which is in accordance to the indexer proposed by<br />

(Shoaib and Shah, 2006).<br />

Validation of ITR metric: To validate ITR metric we<br />

take the case study as shown in table 2. Dynamic<br />

weights vary each time according to the context in<br />

which the index term is used in various documents. For<br />

example, the user provides the query to search out<br />

in<strong>for</strong>mation about “lyrics day”. In the indexer, as shown<br />

in table 2, the index term “day” is assigned the dynamic<br />

weight 5.39 in document 0 but it is assigned dynamic<br />

weight 2.24 in document 1 according to the context in<br />

which it is used. Higher the value of the dynamic weight,<br />

the more the relevancy. There<strong>for</strong>e, document 0 is returned<br />

in response to user query.<br />

Validation of QI Metric: Indexer <strong>quality</strong> metric is<br />

validated by the following case study: Suppose the user<br />

queries the Google search engine to get in<strong>for</strong>mation about<br />

the keyword “dooms day” with the intention to get the<br />

in<strong>for</strong>mation about the Day of Judgment. The result of the<br />

Google search engine in response to the user query is<br />

shown in the figure<br />

The search engine retrieved 5520000 results<br />

against the user query. For the case study we are taking<br />

only the top ten results shown in the ranking order.<br />

After analyzing the contents of the documents the<br />

following scaled up weights are assigned to the top ten<br />

documents:<br />

d1= 0.10 d2= 0.075 d3= 0.050<br />

d4= 0.20 d5= 0.025 d6=0.075<br />

d7= 0.30 d8= 0.025 d9=0.050<br />

d10= 0.075<br />

then the <strong>quality</strong> of indexer is <strong>com</strong>puted as:<br />

Q(I) = (0.10 + 0.075+ 0.050+ 0.20 + 0.025 +0.075+<br />

0.30 + 0.025 +<br />

0.050 + 0.075)<br />

Q(I) = 0.9<br />

Which indicates that 0 ≤ Q(I) ≥ 1<br />

The value of Q(I) closer to 1 means the indexer<br />

has indexed valuable and relevant documents<br />

Proceeding further we get the average <strong>quality</strong> per<br />

indexed document as<br />

Q avg (I) = w(I) / │D│<br />

= 0.9 / 10 = 0.09<br />

183


Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />

Conclusion and Future Work: We have<br />

proposed a number of <strong>metrics</strong> <strong>for</strong> the RST based IR<br />

systems. Number of Segments(NoS) shows that the<br />

correct number of segments can be validated in advance<br />

leading to better decision on the storage capacity and<br />

better understanding of the <strong>rhetorical</strong> relations and the<br />

RST tree construction. The next metric we define is<br />

related to the semantic distance <strong>com</strong>putation between<br />

any two nodes of the RST graph. Then we propose a<br />

number of <strong>metrics</strong> related to the most important<br />

<strong>com</strong>ponent of the RST based IR architecture i.e.<br />

Indexer. We propose the size of indexer metric based<br />

on the number of key terms indexed and validate it to<br />

show that the size of indexer metric is directly related to<br />

the efficiency of the indexer. Indexer Term Relevancy<br />

metric is based on the value of maximum dynamic<br />

weight. The degree of the relevancy indicates better<br />

per<strong>for</strong>mance of the indexer. Higher the degree of term<br />

relevancy better will be the efficiency of the indexer.<br />

Extending the same concept, we propose and validate<br />

the user query relevancy metric which <strong>com</strong>putes the<br />

efficacy of the indexer to return most relevant results in<br />

response to the user query. Finally, we propose the<br />

<strong>quality</strong> of indexer metric based on the weight assigned<br />

to the various documents indexed.<br />

A challenging issue regarding the<br />

enhancement of this research is to automate the results<br />

of the <strong>metrics</strong> so that when the user enters a query these<br />

<strong>metrics</strong> are automatically <strong>com</strong>puted to show the<br />

per<strong>for</strong>mance of the search engine. Another issue which<br />

can be addressed in future is to conduct the <strong>com</strong>parative<br />

study of the RST based IR systems based on the<br />

<strong>com</strong>parison of the per<strong>for</strong>mance of the RST based<br />

systems with the other In<strong>for</strong>mation Retrieval systems<br />

using the <strong>metrics</strong> defined. An automated model can be<br />

designed <strong>for</strong> the <strong>com</strong>parative analysis of the RST based<br />

IR systems with other IR systems. This model will have<br />

an interface like the search engine to accept the user<br />

query and to return the <strong>com</strong>parative per<strong>for</strong>mance<br />

statistics of the RST based IR systems.<br />

REFERENCES<br />

Arnt, A., S. Zilberstein, J. Allan and A. Mouaddib.<br />

Dynamic Composition of In<strong>for</strong>mation<br />

Retrieval Techniques, Journal of Intelligent<br />

In<strong>for</strong>mation Systems, 23:67-97 (2004).<br />

Bosma, W. Query Based Summarization using<br />

Rhetorical Structure Theory, 15th Meeting of<br />

CLIN, 29-44, (2005).<br />

Brown, E. W. Execution Per<strong>for</strong>mance Issues in Full-<br />

Text In<strong>for</strong>mation Retrieval, Technical Report,<br />

95-81, (1995).<br />

Henzinger, M. R., A. Heydon, M. Mitzenmacher and<br />

Najork, M. Measuring Index Quality Using<br />

Random Walks on the Web, Comput. Netw, 31:<br />

1291-1303 (1999).<br />

Jun'chi. F and J. Tsujii. Breaking down <strong>rhetorical</strong> relations<br />

<strong>for</strong> the purpose of analyzing discourse <strong>structure</strong>s,<br />

15th International Conference on Computational<br />

Linguistics, Kyoto, 1177-1183 (1994).<br />

Liu, P., Z. Zhu and L. Zhao. Research on In<strong>for</strong>mation<br />

Retrieval System Based on Ant Clustering<br />

Algorithm, Journal of <strong>software</strong>, 4:1032-1036<br />

(2009).<br />

Mann, W. C. and S. A. Thompson. Rhetorical Structure<br />

Theory: description and construction of text<br />

<strong>structure</strong>s, In<strong>for</strong>mation Sciences Institute,<br />

Nijmegen, The Netherlands, ISI/RS-86-174,1-15<br />

(1986).<br />

Mann, W. C. and S. A Thompson. Rhetorical Structure<br />

Theory: A Framework <strong>for</strong> the Analysis of Texts,<br />

IPRA Papers in Pragmatics 1:79-105 (1987).<br />

Mann, W. C. and S. A. Thompson. Rhetorical Structure<br />

Theory: description and construction of text<br />

<strong>structure</strong>s, In Natural Language Generation:<br />

Recent Advances in Artificial Intelligence,<br />

Psychology, and Linguistics (G. Kempen, Hrsg.),<br />

Boston/Dordrecht: Kluwer Academic Publishers,<br />

85-96 (1987).<br />

Marcu, D. The Rhetorical Parsing of Natural Language<br />

Texts, The Proceedings of the 35th Annual<br />

Meeting of the Association <strong>for</strong> Computational<br />

Linguistics, (ACL'97/EACL'97) Madrid, Spain,<br />

96-103 (1997).<br />

Marcu, D. The Theory and Practice of Discourse Parsing<br />

and Summarization, 1st Edn., The MIT press,<br />

Cambridge, MA., ISBN- 10: 0262133725, 268,<br />

(2000).<br />

Mathkour, H. I., A. A. Touir and W. A. Al-Sanea. Parsing<br />

Arabic Texts Using Rhetorical Structure Theory,<br />

Journal of Computer Science, 4:713-720 (2008).<br />

Pa<strong>pk</strong>a, R., J. P. Callan, and A. G. Barto. Text Based<br />

In<strong>for</strong>mation Retrieval Using Exponential Gradient<br />

Descent, proc. NIPS, X, 3-9 (2008).<br />

Petratos, P. In<strong>for</strong>mation Retrieval Systems: A Perspective<br />

on Human Computer Interaction, Issues in<br />

In<strong>for</strong>ming Science and In<strong>for</strong>mation Technology,<br />

3:511-518 (2006).<br />

Shoaib, M. and A. A. Shah. A new indexing technique <strong>for</strong><br />

IR systems using Rhetorical Structure Theory,<br />

journal of <strong>com</strong>puter science, 2006.<br />

Taboada, M. and W. C. Mann. Applications of Rhetorical<br />

Structure Theory, Discourse Studies, 8:567-588<br />

(2005).<br />

Taboada, M. and W. C. Mann. Rhetorical Structure<br />

Theory: Looking Back and Moving Ahead,<br />

Discourse Studies, 8:423-459 (2006).<br />

184

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!