YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...
YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ... YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...
development of user interfaces, which by means of taxonomies lead the user to the searched domain of interest (Dörre, Gerstl and Seiffert, 2004, p. 480). One of the biggest challenges of information retrieval systems is the fact that most (if not all) of the documents — in the broader sense of the word — are written in a natural language ( 1 ). Among others these communication systems are characterised by a certain number of different possibilities to refer to the same extralinguistic facts. Sometimes it is even a good method in a communication situation to reformulate a statement in order to show the understanding of the original one and/or to get a conirmation of it. In other cases, certain facts are paraphrased without a speciic keyword being used. for example, if an author describes a nice lady in white clothes sitting on a glass bowl, it could mean that he is talking about luck. this example of a medieval allegory, which today is probably only understood by some experts, proves another characteristic of human language: it changes over the course of time. New expressions appear and others disappear or change their meaning. It has also to be kept in mind that the items of our vocabulary are not only in a certain relation with the concept of an extralinguistic fact, but are related to each other as well. this is why, for instance, in good dictionaries you ind hints about synonyms or antonyms to a given expression. But in the context of environment protection, for example, it could also be of interest to retrieve information on air pollution. this case indicates the complexity of the relations between expressions, and it may be doubted if such word ields are ever complete. Such circumstances obviously make an automatic retrieval of performing information extremely complex. So documents need to be analysed by means of scientiic linguistic methodology which goes far beyond the still widespread approach of simple text indexing and clearing. the still ‘young’ domain of text mining tries to develop appropriate methods to support the digging for information within natural language documents. text mining is sometimes also referred to as ‘text data mining’ or ‘knowledge discovery in text’. In general it deines the process of retrieving information in texts. this is the most important difference between data mining and text mining. while data mining procedures try to extract relevant information from structured databases, text mining concentrates on unstructured text doc- ( 1 ) Some attempts to translate documents into a more formal language — Interlingua — were not very successful. See the comments by hutchins (1986, Chapter 10). 01_2007_5222_txt_ML.indd 152 6-12-2007 15:14:05
WORKSHOP uments. the tasks and objectives for the analysis process are more or less the same (Dörre, Gerstl and Seiffert, 2004, p. 480). Information is typically identiied through processes discovering patterns and relations mainly by means of statistical pattern learning. texts are generally regarded as unstructured data in contrast to database information, which is supposed to be structured. text mining usually involves the process of structuring the input text by ‘parsing’, which is completed by the addition and/or removal of linguistic features. this restructuring of data permits the derivation patterns as well as evaluation and interpretation of the output. the quality of text mining is usually judged on the combination of relevance, novelty and tractability. typical text-mining tasks include text classiication, text clustering, concept or entity extraction, document summarisation and modelling of entity relations. text-mining processes may be described as a subsequent low of activities. By means of statistical algorithms, the key terms of a textual entity are identiied. Comparison with entries in ontologies offers possibilities to group those texts together with similar ones. In this way, a basis of knowledge is created and extended after analysing other documents. An example will show the complexity of the necessary methods. Imagine that a document contains the German word Birne ‘pear’. It has to be taken into account that the use of this term could be an ellipsis or a metaphor. that leads us to the following virtual classes, which distinguish from each other by the different meanings of the key term: (1) a kind of fruit, (2) the tree which produces the fruits (‘pear tree’); this is the elliptic use for Birnenbaum, (3) the wood of a pear tree which is used for the construction of furniture; this is an ellipsis for Birnenholz, (4) an electric bulb which in many cases has a form resembling a pear; this is a metaphor well established in the German vocabulary and at the same time an ellipsis for Glühbirne, (5) ironically the head of a human being which in certain stylistic contexts may be compared with a pear; in that case it could be regarded as a metaphor. Although the last one of these variants only has to be taken into account depending on the stylistic context, the other ones need deeper analysis so that the documents concerned can be related to similar ones. In the irst case, this could consist of references to other types of fruit or foods. If the document 152 | 153 01_2007_5222_txt_ML.indd 153 6-12-2007 15:14:05
- Page 101 and 102: WORKSHOP • the integrity of a rec
- Page 103 and 104: WORKSHOP Examples of different appr
- Page 105 and 106: WORKSHOP conidence uses certiicatio
- Page 107 and 108: WORKSHOP 3. LEGISLAtIVE ISSUES CONC
- Page 109 and 110: WORKSHOP signature, is only publish
- Page 111 and 112: WORKSHOP 3.2.4. ARE THERE ACTS, DEC
- Page 113 and 114: WORKSHOP Electronic signature of PD
- Page 115 and 116: WORKSHOP formats are available: htm
- Page 117 and 118: WORKSHOP the chain of conidence is
- Page 119 and 120: WORKSHOP the object of SOLON is to
- Page 121 and 122: WORKSHOP 5. A secure session is now
- Page 123 and 124: WORKSHOP ESTONIA A certiicate-based
- Page 125 and 126: WORKSHOP ertheless, some assistance
- Page 127 and 128: WORKSHOP (b) If the system is XmL-b
- Page 129 and 130: COHERENCE OF TERMINOLOGY AND SEARCH
- Page 131 and 132: WORKSHOP nym for legal categories,
- Page 133 and 134: WORKSHOP Article 4(2) of Directive
- Page 135 and 136: WORKSHOP the tool could prove to be
- Page 137 and 138: EUR-LEX: FROM DATA STRUCTURES TO LE
- Page 139 and 140: WORKSHOP duces a ‘magic result’
- Page 141 and 142: WORKSHOP sion of the current one, i
- Page 143 and 144: WORKSHOP for test and demonstration
- Page 145 and 146: WORKSHOP focus on text representati
- Page 147 and 148: WORKSHOP As a irst step, the existi
- Page 149 and 150: WORKSHOP REfERENCES Bench-Capon, t.
- Page 151: TEXT MINING 1. INtRODUCtION the gro
- Page 155 and 156: WORKSHOP In order to create more ef
- Page 157 and 158: WORKSHOP jects. the problem, howeve
- Page 159 and 160: WORKSHOP space and their maintenanc
- Page 161 and 162: WORKSHOP the success of the impleme
- Page 163 and 164: WORKSHOP Thesauri thesauri are cont
- Page 165 and 166: WORKSHOP knowledge which are reusab
- Page 167 and 168: WORKSHOP fuhr, Norbert. 2004. Infor
- Page 169 and 170: WORKSHOP Oberle, Daniel; Staab, Ste
- Page 171: En tant que déléguée de la Grèc
- Page 175 and 176: PRESS REVIEW / REVUE DE PRESSE " 17
- Page 178 and 179: 01_2007_5222_txt_ML.indd 178 6-12-2
- Page 180: 01_2007_5222_txt_ML.indd 180 6-12-2
development of user interfaces, which by means of taxonomies lead the user<br />
to the searched domain of interest (Dörre, Gerstl and Seiffert, 2004, p. 480).<br />
One of the biggest challenges of information retrieval systems is the fact<br />
that most (if not all) of the documents — in the broader sense of the word —<br />
are written in a natural language ( 1 ). Among others these communication systems<br />
are characterised by a certain number of different possibilities to refer to<br />
the same extralinguistic facts. Sometimes it is even a good method in a communication<br />
situation to reformulate a statement in order to show the understanding<br />
of the original one and/or to get a conirmation of it. In other cases,<br />
certain facts are paraphrased without a speciic keyword being used. for example,<br />
if an author describes a nice lady in white clothes sitting on a glass<br />
bowl, it could mean that he is talking about luck. this example of a medieval<br />
allegory, which today is probably only understood by some experts, proves<br />
another characteristic of human language: it changes over the course of time.<br />
New expressions appear and others disappear or change their meaning.<br />
It has also to be kept in mind that the items of our vocabulary are not only<br />
in a certain relation with the concept of an extralinguistic fact, but are related<br />
to each other as well. this is why, for instance, in good dictionaries you ind<br />
hints about synonyms or antonyms to a given expression. But in the context<br />
of environment protection, for example, it could also be of interest to retrieve<br />
information on air pollution. this case indicates the complexity of the relations<br />
between expressions, and it may be doubted if such word ields are ever<br />
complete.<br />
Such circumstances obviously make an automatic retrieval of performing<br />
information extremely complex. So documents need to be analysed by means<br />
of scientiic linguistic methodology which goes far beyond the still widespread<br />
approach of simple text indexing and clearing. the still ‘young’ domain of text<br />
mining tries to develop appropriate methods to support the digging for information<br />
within natural language documents.<br />
text mining is sometimes also referred to as ‘text data mining’ or ‘knowledge<br />
discovery in text’. In general it deines the process of retrieving information<br />
in texts. this is the most important difference between data mining and<br />
text mining. while data mining procedures try to extract relevant information<br />
from structured databases, text mining concentrates on unstructured text doc-<br />
( 1 ) Some attempts to translate documents into a more formal language — Interlingua —<br />
were not very successful. See the comments by hutchins (1986, Chapter 10).<br />
01_2007_5222_txt_ML.indd 152 6-12-2007 15:14:05