YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

More documents

Recommendations

Info

tual centre for text mining in biomedicine)). It offers the automatic analysis of publications of any kind in biology and medicine. furthermore it helps to classify the electronically available, but mostly unstructured, information on patients. thus, account can also be taken of relations which until now had not been taken into account. In the United States of America, the National Library of medicine has used text mining for 15 years with great success. the British government has supported the UK National Centre for text mining with a GBP 1 million grant. the Japanese national parliament recently decided to establish a centre for text mining in biology. the list can go on, and shows that the technology is considered to be well advanced. Drew Robb (2004) gives an impressive list of projects world wide, which handle tremendous amounts of data. 4. tEXt mINING IN COmPARISON wIth OthER INfORmAtION- REtRIEVAL mEthODOLOGIES Information retrieval has not only become a discipline since the widespread use of computers, particularly personal computers. Some of the methodologies are in fact even older than computers. Some of these methodologies will be described, together with an analysis of the sort of advantages that textmining technologies offer. Boolean retrieval Boolean retrieval is widely used, mostly because of its simple syntax. terms are researched and may be combined with the operators AND (∧), OR (∨) and NOt (¬). Although there are possibilities to modify the priorities of combinations which might result in rather complex requests, it is not too dificult to validate the syntax of a command. however, it is impossible to control the semantics of a request; for instance, it is not possible to detect terms which exclude each other. In many applications which are based on this technology, the syntax is extended by additional operators for comparison such as > (greater than), < (less than), = (equal), ≥ (greater or equal), ≤ (less or equal) and ≠ (not equal). terms may be related to ields of structured data in a database or refer to words or expressions within unstructured data. the eficiency of Boolean retrieval is often improved by collecting keywords in so-called inverted lists. Each expression is accompanied by references to the documents from which it was extracted. the problem is, however, that these lists take up a lot of storage 01_2007_5222_txt_ML.indd 158 6-12-2007 15:14:06
WORKSHOP space and their maintenance may be very time-consuming, because in many cases the inverted indexes have to be regenerated. this is why the expressions of an inverted list are reduced to those terms which have been collected in a controlled vocabulary. Boolean retrieval has some weaknesses. In particular, the quality of a retrieval result is very dificult to control. furthermore, in some cases the retrieval is considered to be too strict. A request such as A ∧ B ∧ C delivers results only if the data contains all three expressions; the result is empty if only two conditions are fulilled. Vector model In general, vector model-based systems lead to the best result in retrieval processes. Instead of focusing on single-index items from a microscopic point of view, the vector model starts from a macroscopic perspective. It supposes that documents are characterised by the statistical distribution of the terms. when a retrieval is started, a vector model-based system tries to identify those documents which suit the query best with regard to the statistics of the concerned terms. to do so, formulae have been developed which help to calculate the similarity between various documents. the calculation of distances between text documents is based on high dimensional vectors of features. the features identiied in all documents of a given collection are extracted. the sum of all these features creates a feature space. On the basis of predeined selection criteria, the number of these features is reduced. the elimination of so-called stop words is one of a number of applied techniques. Another one determines — by means of statistical analysis — particularly high or low frequent terms. At the end of this phase, which is generally called feature reduction, the feature space has n dimensions. It is now possible to describe each text of the collection by a vector. the value of an element in each dimension, v j , is deined by the feature matrix which corresponds to the document. the concrete value of v j depends on the applied methodology. Sometimes it is suficient to signal the presence or absence of a feature by the igures 0 and 1. In other cases, the absolute of normalised frequence is preferred. Normalisation is generally necessary to compensate for the varying length of documents. the distance between two texts can now be determined on the basis of the corresponding vectors. Simple values depend on the distance between the points deined by the vectors in an n-dimensional space or by the angle which is formed by the vectors. In general, the values are normalised to results which are placed between 0 and 1, for 158 | 159 01_2007_5222_txt_ML.indd 159 6-12-2007 15:14:06
Page 1 and 2:
Speeches and proceedings 25th anniv
Page 3 and 4:
25 YEARS OF ONLINE THE EVENT 25 ANN
Page 5 and 6:
INTRODUCTION APRèS LA PUBLICAtION
Page 10 and 11:
wORKShOP Legal XmL — Use of XmL f
Page 12 and 13:
01_2007_5222_txt_ML.indd 12 6-12-20
Page 14 and 15:
Cette énumération des participant
Page 16 and 17:
01_2007_5222_txt_ML.indd 16 6-12-20
Page 18 and 19:
sion a fait de l’initiative «mie
Page 20 and 21:
II. «mIEUX LéGIféRER» Et L’AC
Page 22 and 23:
01_2007_5222_txt_ML.indd 22 6-12-20
Page 24 and 25:
• et L. E. Allen, spécialiste de
Page 26 and 27:
La création d’un réseau de coop
Page 28 and 29:
01_2007_5222_txt_ML.indd 28 6-12-20
Page 30 and 31:
Stele so weit nähern, dass er den
Page 32 and 33:
nicht autorisierte Abschrift der of
Page 34 and 35:
„Die Anforderungen und Bedingunge
Page 36 and 37:
hat, „Zugänglichkeit und Verstä
Page 38 and 39:
Old testament. this passage is set
Page 40 and 41:
this checkpoint is given a Priority
Page 42 and 43:
knowledge of the law. Adults, child
Page 44 and 45:
It is a dificult task to predict th
Page 47 and 48:
MEETING OF THE COUNCIL WORKING PART
Page 49 and 50:
SOUVENIRS D’UNE DÉLÉGUÉE NATIO
Page 51 and 52:
EUR-LEX TODAY AND TOMORROW After mo
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
DOCUMENT ANALYSIS AND LEGAL INFORMA
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
LIFE AS A CELEX HOST INtRODUCtION I
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
CONCLUSIONS first, I would like to
Page 75 and 76:
En tant que déléguée de la Grèc
Page 77 and 78:
LEGAL XML — USE OF XML FOR THE PR
Page 79 and 80:
WORKSHOP • publishing technologie
Page 81 and 82:
WORKSHOP • NiR and the NiR editor
Page 83 and 84:
WORKSHOP tors, thus providing the o
Page 85 and 86:
WORKSHOP 3. SCOPE the expected resu
Page 87 and 88:
WORKSHOP gestützt auf die Verordnu
Page 89 and 90:
WORKSHOP which describe general com
Page 91 and 92:
WORKSHOP Arithmetic Poetry. Ameri
Page 93 and 94:
WORKSHOP Metadata fields collected
Page 95 and 96:
WORKSHOP tion of the common metadat
Page 97 and 98:
ELECTRONIC PUBLISHING OF LEGISLATIO
Page 99 and 100:
WORKSHOP the working group has focu
Page 101 and 102:
WORKSHOP • the integrity of a rec
Page 103 and 104:
WORKSHOP Examples of different appr
Page 105 and 106:
WORKSHOP conidence uses certiicatio
Page 107 and 108: WORKSHOP 3. LEGISLAtIVE ISSUES CONC
Page 109 and 110: WORKSHOP signature, is only publish
Page 111 and 112: WORKSHOP 3.2.4. ARE THERE ACTS, DEC
Page 113 and 114: WORKSHOP Electronic signature of PD
Page 115 and 116: WORKSHOP formats are available: htm
Page 117 and 118: WORKSHOP the chain of conidence is
Page 119 and 120: WORKSHOP the object of SOLON is to
Page 121 and 122: WORKSHOP 5. A secure session is now
Page 123 and 124: WORKSHOP ESTONIA A certiicate-based
Page 125 and 126: WORKSHOP ertheless, some assistance
Page 127 and 128: WORKSHOP (b) If the system is XmL-b
Page 129 and 130: COHERENCE OF TERMINOLOGY AND SEARCH
Page 131 and 132: WORKSHOP nym for legal categories,
Page 133 and 134: WORKSHOP Article 4(2) of Directive
Page 135 and 136: WORKSHOP the tool could prove to be
Page 137 and 138: EUR-LEX: FROM DATA STRUCTURES TO LE
Page 139 and 140: WORKSHOP duces a ‘magic result’
Page 141 and 142: WORKSHOP sion of the current one, i
Page 143 and 144: WORKSHOP for test and demonstration
Page 145 and 146: WORKSHOP focus on text representati
Page 147 and 148: WORKSHOP As a irst step, the existi
Page 149 and 150: WORKSHOP REfERENCES Bench-Capon, t.
Page 151 and 152: TEXT MINING 1. INtRODUCtION the gro
Page 153 and 154: WORKSHOP uments. the tasks and obje
Page 155 and 156: WORKSHOP In order to create more ef
Page 157: WORKSHOP jects. the problem, howeve
Page 161 and 162: WORKSHOP the success of the impleme
Page 163 and 164: WORKSHOP Thesauri thesauri are cont
Page 165 and 166: WORKSHOP knowledge which are reusab
Page 167 and 168: WORKSHOP fuhr, Norbert. 2004. Infor
Page 169 and 170: WORKSHOP Oberle, Daniel; Staab, Ste
Page 171: En tant que déléguée de la Grèc
Page 175 and 176: PRESS REVIEW / REVUE DE PRESSE " 17
Page 178 and 179: 01_2007_5222_txt_ML.indd 178 6-12-2
Page 180: 01_2007_5222_txt_ML.indd 180 6-12-2
show all

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?