software quality metrics for rhetorical structure theory ... - Paas.com.pk
software quality metrics for rhetorical structure theory ... - Paas.com.pk
software quality metrics for rhetorical structure theory ... - Paas.com.pk
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
SOFTWARE QUALITY METRICS FOR RHETORICAL STRUCTURE THEORY<br />
BASED INFORMATION RETRIEVAL SYSTEMS<br />
M. Shoaib, S. Arshad, S. Jabeen and M. P. Tariq *<br />
Department of Computer Science & Engineering, Lahore, Pakistan, * Department of CS, Virtual University of<br />
Pakistan<br />
Corresponding author: tariq_cp@hotmail.<strong>com</strong><br />
ABSTRACT: Currently, several in<strong>for</strong>mation retrieval models exist, <strong>for</strong> example, Boolean<br />
model, vector space model, probabilistic model etc. but none of them provides exact in<strong>for</strong>mation<br />
required to the user. The search engines based on such models emphasize on the syntactic aspects<br />
of the query rather than capturing the semantics. Most of the work done in the domain of<br />
in<strong>for</strong>mation retrieval is based on the keyword based indexing technique. Searching based on this<br />
keyword based indexing technique leads to an irrelevant result set and poor per<strong>for</strong>mance of search<br />
engines. The world of in<strong>for</strong>mation retrieval is revolutionized by the <strong>rhetorical</strong> <strong>structure</strong> <strong>theory</strong><br />
because it improvised the per<strong>for</strong>mance of search engines by introducing the semantic based<br />
retrieval, hence, reducing the irrelevant search to its minimum. In this paper, we present and<br />
validate design metric <strong>for</strong> <strong>rhetorical</strong> <strong>structure</strong> <strong>theory</strong> based in<strong>for</strong>mation retrieval systems so as to<br />
measure the relevance retrieval and to improvise the retrieval per<strong>for</strong>mance.<br />
Keywords: Keyword based indexing, dynamic weight, text based in<strong>for</strong>mation retrieval, tokens, discourse <strong>structure</strong><br />
INTRODUCTION<br />
In Text Based In<strong>for</strong>mation Retrieval (TBIR)<br />
systems, when the user enters a query ‘q’ against a<br />
collection of documents ‘c’ then each document ‘d’ is<br />
examined and is given a weight according to the criteria<br />
of how well it satisfies the semantics of the query ‘q’ or<br />
in other words, how well requirement of the user is<br />
fulfilled. For every instance of triple , the<br />
weight assignment attributed to a document ‘d’ is done<br />
by a function evaluation. Generally speaking,<br />
an In<strong>for</strong>mation Retrieval (IR) system (Arnt et al., 2004<br />
and Petratos, 2006) per<strong>for</strong>ms various retrieval tasks<br />
such as document indexing, ranking and classification.<br />
Document ranking is achieved by sorting all documents<br />
in the collection on the basis of assigned weights<br />
(Pa<strong>pk</strong>a et al., 2008). The core job of an IR system is the<br />
parsing process. The tokens produced in the parsing of<br />
documents and queries are called terms. The query<br />
provided to the IR system in natural language is parsed<br />
into a set of terms. The terms derived from a document<br />
are used to build an inverted list which is used as an<br />
index to the document collection. Normally, in IR<br />
systems it is assumed that if there is co-occurrence of a<br />
term in the query and a document then the document is<br />
relevant to the query. Also the co-occurrences of<br />
multiple terms of a query in a document contribute to<br />
the degree of relevance of that document. Current<br />
in<strong>for</strong>mation retrieval techniques (Liu et al., 2009) are<br />
able to retrieve only 30% of the relevant in<strong>for</strong>mation.<br />
Currently, the in<strong>for</strong>mation retrieval systems are based<br />
on key-based indexing technique in which keywords<br />
are assigned static weights using some retrieval models<br />
such as extended Boolean Model, Vector Model or<br />
Probabilistic Model etc. The words carrying different<br />
meaning in various contexts can result in retrieval of<br />
irrelevant in<strong>for</strong>mation (Shoaib and Shah, 2006).<br />
Rhetorical Structure Theory (RST) (Mann and<br />
Thompson, 1986, Mann and Thompson, 1987, Mathkour,<br />
et al., 2008, Taboada and Mann, 2006) is considered as a<br />
descriptive linguistic approach to identify the textual<br />
relations with the intention of generating text. The success<br />
of RST can be seen from the wide application of this <strong>theory</strong><br />
in various domains ranging from discourse analysis to<br />
psycholinguistics, theoretical linguistics and <strong>com</strong>putational<br />
linguistics (Jun'chi and Tsujii, 1994, Mathkour, et al.,<br />
2008, Taboada and Mann, 2005). This <strong>theory</strong> is remarkably<br />
used in various applications of <strong>com</strong>putational linguistics,<br />
<strong>for</strong> example, text parsing and generation, machine<br />
translation and easy scoring and most importantly in<br />
natural language processing (Mann and Thompson, 1987,<br />
Marcu, 1997).This <strong>theory</strong> organizes the text by means of<br />
relations that exist between various sections of the text and<br />
establishes a coherency in which every section of the text<br />
plays a role or function with respect to other sections of the<br />
text. The resulting relations are termed as coherence<br />
relations, discourse relations, conjunction relations and<br />
Rhetorical Relations. There may be 30 different relations in<br />
various sections of text (Taboada, and Mann, 2005,<br />
Taboada, and Mann, 2006). RST identifies two different<br />
types of text units: The Nucleus and the Satellites. Nuclei<br />
are of prime importance whereas the satellites contribute to<br />
the nucleus and are considered secondary. In elaboration<br />
relation the nucleus is the part of the text containing basic<br />
in<strong>for</strong>mation and satellites are parts of the text containing<br />
177
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
additional or supporting in<strong>for</strong>mation about the nucleus.<br />
RST relations on the text are applied recursively until<br />
all text units are associated with one of the RST<br />
relations (Taboada, and Mann, 2006).<br />
With the gigantic increase in the volume of the<br />
web contents, it has be<strong>com</strong>e a burning need of the day<br />
to improvise the per<strong>for</strong>mance of the search engines so<br />
that they may be able to retrieve the semantically<br />
relevant in<strong>for</strong>mation in response to the user query.<br />
Currently most of the search engines are based on the<br />
keyword based searching techniques which are<br />
considered to be static due to the reliance on the syntax<br />
only. These IR systems are unable to handle the various<br />
contexts in which the same term appears in different<br />
documents. RST based IR system cope up with this<br />
problem by capturing the semantics of the terms using<br />
the dynamic weight assignment techniques.<br />
In this paper, we present various design<br />
<strong>metrics</strong> <strong>for</strong> RST based IR systems to measure the<br />
retrieval relevancy and improve per<strong>for</strong>mance of the IR<br />
systems. We also present validation of the proposed<br />
<strong>metrics</strong> by taking various case studies. We have made<br />
base of our research work the architecture of RST based<br />
IR system presented in (Shoaib and Shah, 2006). Case<br />
studies to validate the proposed design <strong>metrics</strong> have<br />
also been taken from the work presented by (Shoaib<br />
and Shah, 2006).<br />
The rest of the paper is organized as follows:<br />
In section 2, we describe the architecture of RST based<br />
IR system based upon which we present the design<br />
<strong>metrics</strong> and their validation. In section 3, we present<br />
materials and methods. In section 4, we present results<br />
and discussions and in section 5 we conclude our work<br />
and present future directions.<br />
Architecture of RST based IR system: Since, we<br />
have selected the architecture of RST based IR system<br />
proposed by Shoaib and Shah (2006) as a base of our<br />
research work, there<strong>for</strong>e, in this section we describe this<br />
architecture in detail. They proposed an indexing<br />
technique with dynamic weight assignment technique<br />
based on RST. They used the punctuation marks and<br />
cue phrases (Marcu, 2000) (words that connect two or<br />
more text spans) to define the <strong>rhetorical</strong> relations. They<br />
constructed RST tree whose leaves represent the text<br />
spans and the internal nodes represent the <strong>rhetorical</strong><br />
relations. Keywords extracted from the text spans are<br />
assigned dynamic weight such that the keywords in the<br />
text span closer to nucleus are assigned priority<br />
weights. With this dynamic weight assignment<br />
technique the precision rate has been clearly<br />
improvised. The system developed on the basis of this<br />
approach is able to capture the semantics of a document<br />
in an efficient way. The architecture (given by Shoaib<br />
and Shah, 2006) of their proposed model consists of<br />
four different modules: 1) Segmentor 2) Rhetorical<br />
Relation Finder 3) Rhetorical Tree Parser and 3) RST<br />
based Indexer (fig. 1).<br />
Fig-1: Architecture of RST based IR systems proposed<br />
by Shoaib and Shah (2006)<br />
The purpose and functionality of each <strong>com</strong>ponent<br />
is defined as followed.<br />
Segmentor: Text <strong>structure</strong> is usually organized into words,<br />
sentences and paragraphs. There is a correlation between<br />
these different text spans which is needed to be identified.<br />
To handle the <strong>structure</strong> in the text there is a need to identify<br />
the boundaries in the <strong>structure</strong>. This is the primary function<br />
of text segmentation. Text segmentor provides the<br />
structural in<strong>for</strong>mation which is utilized <strong>for</strong> text analysis.<br />
Rhetorical relation finder: In the next step the indexer<br />
automatically finds out the <strong>rhetorical</strong> relations. For this<br />
purpose, concept of cue phrases and punctuation is used to<br />
get the relationship between every text span.<br />
RST tree: From the list of text spans and the list of<br />
relations generated in the previous stages Rhetorical Tree is<br />
generated whose nodes represent the relations and leaves<br />
represent the text spans. Based on the height of the tree,<br />
initial weight is assigned to text spans. The indexer utilizes<br />
the keywords and their corresponding initial/dynamic<br />
weight to populate its knowledge base.<br />
Indexer: Indexer uses the concept of strong and weak<br />
nodes to assess the initial weight. This initial weight<br />
assessment is used <strong>for</strong> dynamic weight assignment. The<br />
initial weight is taken between 0 and 1. Root node is<br />
assigned 1 whereas the nucleus and satellite to the parent<br />
node are assigned 0.9 and 0.5 respectively. These weights<br />
are variable. The weight assigned to child node is<br />
generalized by the following <strong>for</strong>mula. Initial weight of<br />
child node = Weight of child node * Weight of parent<br />
node. After calculating the initial weight of all children<br />
nodes these weights are associated with the index terms<br />
existing in the text spans. On the basis of initial weight<br />
178
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
assessment and term frequency, the actual weight of<br />
index terms is calculated as follows:<br />
Actual weight of the index term = Initial<br />
weight assessment* Term frequency. Indexer maintains<br />
the knowledge base by saving the document ID,<br />
vocabulary ID and actual weights. Document ID keeps<br />
track of which word belongs to which document.<br />
Vocabulary ID assures that there is no redundancy of<br />
words in the knowledge base thereby consuming less<br />
space. Dynamic weight indicates the semantic based<br />
occurrence of important index terms.<br />
MATERIALS AND METHODS<br />
In this section, we propose a number of<br />
<strong>metrics</strong> related to following basic <strong>com</strong>ponents of RST<br />
based IR system.<br />
1. Segmentor<br />
2. Rhetorical relation finder<br />
3. Rhetorical tree parser<br />
4. RST based indexer<br />
Number of segments metric: Number of Segments<br />
(NoS) is one of the proposed <strong>metrics</strong> to improvise<br />
per<strong>for</strong>mance of IR systems. It is <strong>for</strong> segmentor that is<br />
one of basic <strong>com</strong>ponents of RST based IR systems.<br />
This metric calculates the total number of segments into<br />
which a text document can be divided. This metric is<br />
defined by taking the summation of no. of cue phrases<br />
and punctuation marks identified (since these two<br />
parameters are used to divide a text into various<br />
segments). A segment is defined as any elementary unit<br />
such as a keyword, a line of text or a paragraph itself.<br />
There is no way defined yet to validate the<br />
correct number of segments in advance. The output of<br />
the first <strong>com</strong>ponent of the architecture is the list of the<br />
individual segments. This list is further fed into the next<br />
<strong>com</strong>ponent which is the <strong>rhetorical</strong> relation finder. This<br />
<strong>com</strong>ponent works on the list of segments to identify the<br />
<strong>rhetorical</strong> relations correspondingly. The benefit of<br />
<strong>com</strong>puting the no. of segments in advance gives the<br />
exact idea of how much memory will be required to<br />
store the segments. And also, the no. of segments will<br />
be helpful to judge the <strong>structure</strong> of the RST tree in<br />
advance and hence the corresponding <strong>rhetorical</strong><br />
relations.<br />
Mathematically, the No. of Segments metric is<br />
defined as follows:<br />
Rhetorical relation distance metric: There can be two<br />
ways to measure the distance between two text spans. First<br />
method is to calculate the physically linear distance<br />
between two text spans. This syntactically linear distance<br />
means the no. of consecutive sentences that occur between<br />
two text spans. The other way is to <strong>com</strong>pute the semantic<br />
distance between text spans. This distance can be<br />
determined by <strong>com</strong>puting the <strong>rhetorical</strong> relation distance<br />
(RRD) metric. This metric determines the semantic<br />
distance between two text spans in terms of no. of internal<br />
nodes of a discourse graph. The nodes of the graph<br />
represent the text spans. The <strong>com</strong>putation of <strong>rhetorical</strong><br />
distance helps in analyzing the strength of a <strong>rhetorical</strong><br />
relation. As greater the value of the RRD metric, weaker<br />
would be the <strong>rhetorical</strong> relation and vice versa. Hence,<br />
RRD = 0 means strongest relation (Direct relation)<br />
=1 means good relation (Indirect relation)<br />
>1 means weak relation (Indirect relation)<br />
A weaker <strong>rhetorical</strong> relation indicates that there is<br />
an indirect relation between two nodes where each node is<br />
representing a text span. A Rhetorical relation will be<br />
considered stronger if it is a direct relation between two<br />
nodes. Rhetorical relation strength can help in the<br />
construction and improvisation of the <strong>rhetorical</strong> <strong>structure</strong><br />
tree.<br />
Suppose there are two nodes in the RST graph N i and N j<br />
and the level of any node is represented as L with the<br />
indexes i and j representing the specific level number. We<br />
assume that index Lj always indicate a deeper level as<br />
<strong>com</strong>pared to level Li. Index k represents the particular node<br />
number at a specific level then mathematically the<br />
<strong>rhetorical</strong> relation distance between these two nodes will be<br />
Rhetorical Relation Distance (RRD) = L j (Nj k ) -<br />
L i+1 (Ni k ).<br />
Where,<br />
NoC = Number of Cue Phrases identified<br />
NoP = Number of Punctuation Marks Identified<br />
Where Lx>2 and x represents the number of the level<br />
At various depths the node having non sibling parents at<br />
same levels, the RRD will be <strong>com</strong>puted as:<br />
L x = x + (x-1)<br />
L1 = 1 + 0 = 1 case 1[<strong>com</strong>mon parent]<br />
179
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
= 3 + 2 = 5 case 3[non-<br />
L2<br />
2[sibling parent]<br />
L3<br />
sibling parent]<br />
L4<br />
And so on…<br />
= 2 + 1 = 3 case<br />
= 4 + 3 = 7 As Above<br />
Metrics <strong>for</strong> indexer: This section defines some<br />
important measurements <strong>for</strong> the indexer <strong>com</strong>ponent of<br />
the architecture.<br />
Indexer size metric: There are various measures of the<br />
size of indexer. Some researchers have shown that the<br />
size of indexer is taken in terms of number of web<br />
pages crawled, while others emphasize on how much<br />
<strong>com</strong>puter storage is required to support the index <strong>for</strong><br />
measuring the indexer size(Brown, 1995, Henzinger, et<br />
al., 1999).Our proposed indexer size metric adopts a<br />
different approach. We have measured the size of the<br />
indexer in terms of the number of key terms taken from<br />
various documents that exist in a document corpse and<br />
stored in the indexer as index terms.<br />
So the indexer size metric (S_idx ) is defined<br />
as, “The sum of index terms taken from individual<br />
documents of the document collection”.<br />
Mathematically,<br />
Where‘d’ represents a particular document in<br />
the document corpse ‘D’. T_idx j represents the index<br />
term taken from a particular document and j is the<br />
counter <strong>for</strong> total number of index terms taken from a<br />
particular document.<br />
Indexer term relevancy metric: The efficacy of the<br />
indexer mainly depends upon the degree of relevancy of<br />
the index terms. We define the index term relevancy<br />
(ITR) metric as<br />
“The degree of relevancy of the index term<br />
according to the various context in which the term is<br />
used in different documents”<br />
Indexer ensures the semantically relevant<br />
results by assigning the initial weights to the nodes of<br />
the RST tree. Then it <strong>com</strong>putes the dynamic weights of<br />
the index terms based on the initial weights and term<br />
frequency. Mathematically,<br />
Where TF i indicates the term frequency of an<br />
index term and W_init i represents the initial weight of<br />
an index term.<br />
User query relevancy metric: User Query Relevancy<br />
(UQR) metric can be defined as<br />
“Corresponding to a user’s query how much relevant<br />
results are returned by indexer”.<br />
The UQR can be <strong>com</strong>puted in terms of the major<br />
operation per<strong>for</strong>med in the indexer searching process. This<br />
is the process of finding the index terms with maximum<br />
dynamic weight so we <strong>com</strong>pute this major operation using<br />
the max() function. When the user enters a query in natural<br />
language <strong>for</strong>m then the max() function is <strong>com</strong>puted against<br />
each term in the user entered query to search out the most<br />
relevant index terms (terms having maximum dynamic<br />
weights).Mathematically<br />
Quality of indexer metric: Another important metric that<br />
is useful <strong>for</strong> measuring the effectiveness of the indexer is<br />
the Quality of Indexer (QI) metric. When a user enters a<br />
query <strong>for</strong> which there can be multiple answers to choose<br />
from, then the most important issue related to indexer<br />
effectiveness is whether it indexes the in<strong>for</strong>mation that are<br />
more useful from user perspective or not. Another reason<br />
<strong>for</strong> considering the <strong>quality</strong> of the indexer is the fact that the<br />
size of indexer is growing at a slower rate as <strong>com</strong>pared to<br />
the rate of increase in amount of the web contents. Due to<br />
storage and processing power limitations of indexer it<br />
be<strong>com</strong>es an important issue <strong>for</strong> the indexer to index the<br />
most useful and relevant documents. In addition to this, the<br />
search engines should index the useful documents in some<br />
ranked order so that the user may get the most relevant<br />
in<strong>for</strong>mation rather than getting too many results most of<br />
which are irrelevant to the user query. Hence, the indexer<br />
with high <strong>quality</strong> indexes is the growing need of the users.<br />
Q(I) in a broader perspective can be considered at two tiers:<br />
a. The overall <strong>quality</strong> of the documents indexed<br />
b. The average <strong>quality</strong> per indexed document<br />
Suppose every document indexed by the search<br />
engine is given a weight w (d). For convenience, we<br />
assume the weights are scaled so that the sum of all<br />
weights is 1.If we refer I to the indexer and │D│ to the set<br />
of all documents indexed by a search engine, then we<br />
define the <strong>quality</strong> of the indexer Q(I) as<br />
Note that since w is scaled, we always have 0 ≤ Q(I) ≥ 1 .<br />
Let │D│ indicates the set of documents indexed by I then<br />
the average <strong>quality</strong> per indexed document is:<br />
Q avg (I) = Q(I) / │D│<br />
This average <strong>quality</strong> per indexed document is a very good<br />
measure of how well an indexer selects documents <strong>for</strong><br />
indexing purpose.<br />
RESULTS AND DISCUSSIONS<br />
In this section we validate the various <strong>metrics</strong><br />
defined <strong>for</strong> the architecture of RST based IR systems:<br />
180
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
Validation of NoS metric: We validate Number of<br />
Segments (NoS) <strong>for</strong> the following documents.<br />
Table- 1. Documents taken as a sample text used by<br />
Shoaib and Shah (2006).<br />
We have taken two documents (Doc 0 and Doc 1) as<br />
shown in table 1.<br />
For Document 0: NoC = 1 NoP = 3<br />
For Document 1: NoC = 8 NoP = 4<br />
Then the total number of segments will be <strong>com</strong>puted as<br />
follows:<br />
= [([1 + 3] + [8 + 4])]<br />
= [(4 + 12)]<br />
= 16<br />
The same no. of segments are shown by (Shoaib and<br />
Shah, 2006) which is in accordance to our NoS metric.<br />
Validation of RRD metric: Taking an example as a<br />
case study from the work presented by (Bosma, 2005)<br />
the discourse <strong>structure</strong> shown in figure 2 is converted to<br />
the RST graph in figure 3.<br />
The nodes of the RST graph in figure 3 represent<br />
the text spans in terms of nuclei and satellites. The arrows<br />
represent the relation between the nucleus and the satellite<br />
with the head pointing from the nucleus to the satellite.<br />
From this RST graph the RRD between the node 3C and<br />
3A can be calculated as the<br />
RRD between 3C and 3A = L j (Nj k ) - L i+1 (Ni k )<br />
= 2 0 – (0+1) 0<br />
= 1<br />
Hence, the number of internal nodes between 3C<br />
and 3A is 1 which shows that the <strong>rhetorical</strong> relation is good<br />
between 3C and 3A.<br />
Whereas,<br />
RRD between 3C and 3D = L j (Nj k ) - L i+1 (Ni k )<br />
= 1 0 – (0+1) 0<br />
= 0<br />
Hence, the <strong>rhetorical</strong> relation is direct or strong<br />
between 3C and 3D<br />
Exceptional Cases<br />
Case No. 1<br />
If two nodes at same level have a <strong>com</strong>mon parent<br />
node then the RRD will always be equal to 1. So, <strong>for</strong> nodes<br />
3D and 3B in figure 4, RRD between 3B and 3D = 1<br />
Fig-4: RST Graph 2<br />
Fig- 2: Discourse Structure taken from (Bosma,<br />
2005).<br />
Case No. 2<br />
The second exceptional case states that the RRD<br />
between two nodes 3F and 3A (figure 4) having sibling<br />
nodes as parent, will always be equal to 3. Hence<br />
RRD between 3F and 3A = 3<br />
Fig- 3: RST Graph 1<br />
Fig- 5: RST Graph 3<br />
181
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
Table-2. Indexed terms in the knowledge base used by Shoaib and Shah (2006)<br />
182
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
Fig- 6: Top ten results returned by the Google search engine<br />
Case No.3<br />
The third exceptional case states that the RRD<br />
between two nodes 3H and 3E (figure 5) existing at<br />
level 3 and having non-sibling nodes as parent, will be<br />
RRD between 3H and 3E:<br />
L x = x + (x-1)<br />
L3 = 3 + 2<br />
= 5<br />
Hence, the number of internal nodes between<br />
3H and 3E is 5, which indicates that the relation<br />
between node 3H and 3E is indirect and weak.<br />
Validation of S_idx metric: We suppose that the<br />
document corpse consists of two documents as shown<br />
in table 1. The size of indexer is calculated as follows:<br />
We have │D│ = 2<br />
For Doc 0:<br />
D0(T_idx) = 19<br />
For doc 1:<br />
d1(T_idx) = 21<br />
S_idx = 0 + [ 19 + 21 ]<br />
S_idx = 0 + 40<br />
S_idx = 40<br />
Hence, the size of indexer is 40 index terms<br />
which is in accordance to the indexer proposed by<br />
(Shoaib and Shah, 2006).<br />
Validation of ITR metric: To validate ITR metric we<br />
take the case study as shown in table 2. Dynamic<br />
weights vary each time according to the context in<br />
which the index term is used in various documents. For<br />
example, the user provides the query to search out<br />
in<strong>for</strong>mation about “lyrics day”. In the indexer, as shown<br />
in table 2, the index term “day” is assigned the dynamic<br />
weight 5.39 in document 0 but it is assigned dynamic<br />
weight 2.24 in document 1 according to the context in<br />
which it is used. Higher the value of the dynamic weight,<br />
the more the relevancy. There<strong>for</strong>e, document 0 is returned<br />
in response to user query.<br />
Validation of QI Metric: Indexer <strong>quality</strong> metric is<br />
validated by the following case study: Suppose the user<br />
queries the Google search engine to get in<strong>for</strong>mation about<br />
the keyword “dooms day” with the intention to get the<br />
in<strong>for</strong>mation about the Day of Judgment. The result of the<br />
Google search engine in response to the user query is<br />
shown in the figure<br />
The search engine retrieved 5520000 results<br />
against the user query. For the case study we are taking<br />
only the top ten results shown in the ranking order.<br />
After analyzing the contents of the documents the<br />
following scaled up weights are assigned to the top ten<br />
documents:<br />
d1= 0.10 d2= 0.075 d3= 0.050<br />
d4= 0.20 d5= 0.025 d6=0.075<br />
d7= 0.30 d8= 0.025 d9=0.050<br />
d10= 0.075<br />
then the <strong>quality</strong> of indexer is <strong>com</strong>puted as:<br />
Q(I) = (0.10 + 0.075+ 0.050+ 0.20 + 0.025 +0.075+<br />
0.30 + 0.025 +<br />
0.050 + 0.075)<br />
Q(I) = 0.9<br />
Which indicates that 0 ≤ Q(I) ≥ 1<br />
The value of Q(I) closer to 1 means the indexer<br />
has indexed valuable and relevant documents<br />
Proceeding further we get the average <strong>quality</strong> per<br />
indexed document as<br />
Q avg (I) = w(I) / │D│<br />
= 0.9 / 10 = 0.09<br />
183
Pakistan Journal of Science (Vol. 62 No. 3 September, 2010)<br />
Conclusion and Future Work: We have<br />
proposed a number of <strong>metrics</strong> <strong>for</strong> the RST based IR<br />
systems. Number of Segments(NoS) shows that the<br />
correct number of segments can be validated in advance<br />
leading to better decision on the storage capacity and<br />
better understanding of the <strong>rhetorical</strong> relations and the<br />
RST tree construction. The next metric we define is<br />
related to the semantic distance <strong>com</strong>putation between<br />
any two nodes of the RST graph. Then we propose a<br />
number of <strong>metrics</strong> related to the most important<br />
<strong>com</strong>ponent of the RST based IR architecture i.e.<br />
Indexer. We propose the size of indexer metric based<br />
on the number of key terms indexed and validate it to<br />
show that the size of indexer metric is directly related to<br />
the efficiency of the indexer. Indexer Term Relevancy<br />
metric is based on the value of maximum dynamic<br />
weight. The degree of the relevancy indicates better<br />
per<strong>for</strong>mance of the indexer. Higher the degree of term<br />
relevancy better will be the efficiency of the indexer.<br />
Extending the same concept, we propose and validate<br />
the user query relevancy metric which <strong>com</strong>putes the<br />
efficacy of the indexer to return most relevant results in<br />
response to the user query. Finally, we propose the<br />
<strong>quality</strong> of indexer metric based on the weight assigned<br />
to the various documents indexed.<br />
A challenging issue regarding the<br />
enhancement of this research is to automate the results<br />
of the <strong>metrics</strong> so that when the user enters a query these<br />
<strong>metrics</strong> are automatically <strong>com</strong>puted to show the<br />
per<strong>for</strong>mance of the search engine. Another issue which<br />
can be addressed in future is to conduct the <strong>com</strong>parative<br />
study of the RST based IR systems based on the<br />
<strong>com</strong>parison of the per<strong>for</strong>mance of the RST based<br />
systems with the other In<strong>for</strong>mation Retrieval systems<br />
using the <strong>metrics</strong> defined. An automated model can be<br />
designed <strong>for</strong> the <strong>com</strong>parative analysis of the RST based<br />
IR systems with other IR systems. This model will have<br />
an interface like the search engine to accept the user<br />
query and to return the <strong>com</strong>parative per<strong>for</strong>mance<br />
statistics of the RST based IR systems.<br />
REFERENCES<br />
Arnt, A., S. Zilberstein, J. Allan and A. Mouaddib.<br />
Dynamic Composition of In<strong>for</strong>mation<br />
Retrieval Techniques, Journal of Intelligent<br />
In<strong>for</strong>mation Systems, 23:67-97 (2004).<br />
Bosma, W. Query Based Summarization using<br />
Rhetorical Structure Theory, 15th Meeting of<br />
CLIN, 29-44, (2005).<br />
Brown, E. W. Execution Per<strong>for</strong>mance Issues in Full-<br />
Text In<strong>for</strong>mation Retrieval, Technical Report,<br />
95-81, (1995).<br />
Henzinger, M. R., A. Heydon, M. Mitzenmacher and<br />
Najork, M. Measuring Index Quality Using<br />
Random Walks on the Web, Comput. Netw, 31:<br />
1291-1303 (1999).<br />
Jun'chi. F and J. Tsujii. Breaking down <strong>rhetorical</strong> relations<br />
<strong>for</strong> the purpose of analyzing discourse <strong>structure</strong>s,<br />
15th International Conference on Computational<br />
Linguistics, Kyoto, 1177-1183 (1994).<br />
Liu, P., Z. Zhu and L. Zhao. Research on In<strong>for</strong>mation<br />
Retrieval System Based on Ant Clustering<br />
Algorithm, Journal of <strong>software</strong>, 4:1032-1036<br />
(2009).<br />
Mann, W. C. and S. A. Thompson. Rhetorical Structure<br />
Theory: description and construction of text<br />
<strong>structure</strong>s, In<strong>for</strong>mation Sciences Institute,<br />
Nijmegen, The Netherlands, ISI/RS-86-174,1-15<br />
(1986).<br />
Mann, W. C. and S. A Thompson. Rhetorical Structure<br />
Theory: A Framework <strong>for</strong> the Analysis of Texts,<br />
IPRA Papers in Pragmatics 1:79-105 (1987).<br />
Mann, W. C. and S. A. Thompson. Rhetorical Structure<br />
Theory: description and construction of text<br />
<strong>structure</strong>s, In Natural Language Generation:<br />
Recent Advances in Artificial Intelligence,<br />
Psychology, and Linguistics (G. Kempen, Hrsg.),<br />
Boston/Dordrecht: Kluwer Academic Publishers,<br />
85-96 (1987).<br />
Marcu, D. The Rhetorical Parsing of Natural Language<br />
Texts, The Proceedings of the 35th Annual<br />
Meeting of the Association <strong>for</strong> Computational<br />
Linguistics, (ACL'97/EACL'97) Madrid, Spain,<br />
96-103 (1997).<br />
Marcu, D. The Theory and Practice of Discourse Parsing<br />
and Summarization, 1st Edn., The MIT press,<br />
Cambridge, MA., ISBN- 10: 0262133725, 268,<br />
(2000).<br />
Mathkour, H. I., A. A. Touir and W. A. Al-Sanea. Parsing<br />
Arabic Texts Using Rhetorical Structure Theory,<br />
Journal of Computer Science, 4:713-720 (2008).<br />
Pa<strong>pk</strong>a, R., J. P. Callan, and A. G. Barto. Text Based<br />
In<strong>for</strong>mation Retrieval Using Exponential Gradient<br />
Descent, proc. NIPS, X, 3-9 (2008).<br />
Petratos, P. In<strong>for</strong>mation Retrieval Systems: A Perspective<br />
on Human Computer Interaction, Issues in<br />
In<strong>for</strong>ming Science and In<strong>for</strong>mation Technology,<br />
3:511-518 (2006).<br />
Shoaib, M. and A. A. Shah. A new indexing technique <strong>for</strong><br />
IR systems using Rhetorical Structure Theory,<br />
journal of <strong>com</strong>puter science, 2006.<br />
Taboada, M. and W. C. Mann. Applications of Rhetorical<br />
Structure Theory, Discourse Studies, 8:567-588<br />
(2005).<br />
Taboada, M. and W. C. Mann. Rhetorical Structure<br />
Theory: Looking Back and Moving Ahead,<br />
Discourse Studies, 8:423-459 (2006).<br />
184