YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ... YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

estig.ipbeja.pt
from estig.ipbeja.pt More from this publisher
21.11.2013 Views

development of user interfaces, which by means of taxonomies lead the user to the searched domain of interest (Dörre, Gerstl and Seiffert, 2004, p. 480). One of the biggest challenges of information retrieval systems is the fact that most (if not all) of the documents — in the broader sense of the word — are written in a natural language ( 1 ). Among others these communication systems are characterised by a certain number of different possibilities to refer to the same extralinguistic facts. Sometimes it is even a good method in a communication situation to reformulate a statement in order to show the understanding of the original one and/or to get a conirmation of it. In other cases, certain facts are paraphrased without a speciic keyword being used. for example, if an author describes a nice lady in white clothes sitting on a glass bowl, it could mean that he is talking about luck. this example of a medieval allegory, which today is probably only understood by some experts, proves another characteristic of human language: it changes over the course of time. New expressions appear and others disappear or change their meaning. It has also to be kept in mind that the items of our vocabulary are not only in a certain relation with the concept of an extralinguistic fact, but are related to each other as well. this is why, for instance, in good dictionaries you ind hints about synonyms or antonyms to a given expression. But in the context of environment protection, for example, it could also be of interest to retrieve information on air pollution. this case indicates the complexity of the relations between expressions, and it may be doubted if such word ields are ever complete. Such circumstances obviously make an automatic retrieval of performing information extremely complex. So documents need to be analysed by means of scientiic linguistic methodology which goes far beyond the still widespread approach of simple text indexing and clearing. the still ‘young’ domain of text mining tries to develop appropriate methods to support the digging for information within natural language documents. text mining is sometimes also referred to as ‘text data mining’ or ‘knowledge discovery in text’. In general it deines the process of retrieving information in texts. this is the most important difference between data mining and text mining. while data mining procedures try to extract relevant information from structured databases, text mining concentrates on unstructured text doc- ( 1 ) Some attempts to translate documents into a more formal language — Interlingua — were not very successful. See the comments by hutchins (1986, Chapter 10). 01_2007_5222_txt_ML.indd 152 6-12-2007 15:14:05

WORKSHOP uments. the tasks and objectives for the analysis process are more or less the same (Dörre, Gerstl and Seiffert, 2004, p. 480). Information is typically identiied through processes discovering patterns and relations mainly by means of statistical pattern learning. texts are generally regarded as unstructured data in contrast to database information, which is supposed to be structured. text mining usually involves the process of structuring the input text by ‘parsing’, which is completed by the addition and/or removal of linguistic features. this restructuring of data permits the derivation patterns as well as evaluation and interpretation of the output. the quality of text mining is usually judged on the combination of relevance, novelty and tractability. typical text-mining tasks include text classiication, text clustering, concept or entity extraction, document summarisation and modelling of entity relations. text-mining processes may be described as a subsequent low of activities. By means of statistical algorithms, the key terms of a textual entity are identiied. Comparison with entries in ontologies offers possibilities to group those texts together with similar ones. In this way, a basis of knowledge is created and extended after analysing other documents. An example will show the complexity of the necessary methods. Imagine that a document contains the German word Birne ‘pear’. It has to be taken into account that the use of this term could be an ellipsis or a metaphor. that leads us to the following virtual classes, which distinguish from each other by the different meanings of the key term: (1) a kind of fruit, (2) the tree which produces the fruits (‘pear tree’); this is the elliptic use for Birnenbaum, (3) the wood of a pear tree which is used for the construction of furniture; this is an ellipsis for Birnenholz, (4) an electric bulb which in many cases has a form resembling a pear; this is a metaphor well established in the German vocabulary and at the same time an ellipsis for Glühbirne, (5) ironically the head of a human being which in certain stylistic contexts may be compared with a pear; in that case it could be regarded as a metaphor. Although the last one of these variants only has to be taken into account depending on the stylistic context, the other ones need deeper analysis so that the documents concerned can be related to similar ones. In the irst case, this could consist of references to other types of fruit or foods. If the document 152 | 153 01_2007_5222_txt_ML.indd 153 6-12-2007 15:14:05

development of user interfaces, which by means of taxonomies lead the user<br />

to the searched domain of interest (Dörre, Gerstl and Seiffert, 2004, p. 480).<br />

One of the biggest challenges of information retrieval systems is the fact<br />

that most (if not all) of the documents — in the broader sense of the word —<br />

are written in a natural language ( 1 ). Among others these communication systems<br />

are characterised by a certain number of different possibilities to refer to<br />

the same extralinguistic facts. Sometimes it is even a good method in a communication<br />

situation to reformulate a statement in order to show the understanding<br />

of the original one and/or to get a conirmation of it. In other cases,<br />

certain facts are paraphrased without a speciic keyword being used. for example,<br />

if an author describes a nice lady in white clothes sitting on a glass<br />

bowl, it could mean that he is talking about luck. this example of a medieval<br />

allegory, which today is probably only understood by some experts, proves<br />

another characteristic of human language: it changes over the course of time.<br />

New expressions appear and others disappear or change their meaning.<br />

It has also to be kept in mind that the items of our vocabulary are not only<br />

in a certain relation with the concept of an extralinguistic fact, but are related<br />

to each other as well. this is why, for instance, in good dictionaries you ind<br />

hints about synonyms or antonyms to a given expression. But in the context<br />

of environment protection, for example, it could also be of interest to retrieve<br />

information on air pollution. this case indicates the complexity of the relations<br />

between expressions, and it may be doubted if such word ields are ever<br />

complete.<br />

Such circumstances obviously make an automatic retrieval of performing<br />

information extremely complex. So documents need to be analysed by means<br />

of scientiic linguistic methodology which goes far beyond the still widespread<br />

approach of simple text indexing and clearing. the still ‘young’ domain of text<br />

mining tries to develop appropriate methods to support the digging for information<br />

within natural language documents.<br />

text mining is sometimes also referred to as ‘text data mining’ or ‘knowledge<br />

discovery in text’. In general it deines the process of retrieving information<br />

in texts. this is the most important difference between data mining and<br />

text mining. while data mining procedures try to extract relevant information<br />

from structured databases, text mining concentrates on unstructured text doc-<br />

( 1 ) Some attempts to translate documents into a more formal language — Interlingua —<br />

were not very successful. See the comments by hutchins (1986, Chapter 10).<br />

01_2007_5222_txt_ML.indd 152 6-12-2007 15:14:05

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!