21.11.2013 Views

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

allows for identifying the concepts of a document. In fact, both approaches<br />

should be used subsequently to arrive at a clear and solid classiication of texts.<br />

the documents actually are instantiation of different concepts.<br />

text-mining technology is not limited to the basic ideas which are expressed<br />

at the surface of a given text, but discovers other ones as well. thus<br />

documents can be related to each other although the key terms used are different.<br />

for example, a document handling the legislation on ‘environment protection’<br />

can be classiied together with documents treating problems of ‘air pollution’.<br />

Linguistic statistics offer basic methods for the distinction of relevant and<br />

less- or non-relevant components. Regularities concerning the frequency of<br />

words and word forms in a given language allow the description of linguistic<br />

relations. this approach combined with a probabilistic linguistic model is one<br />

of the fundaments of text mining.<br />

One of the most important methods analyses the differences of the vocabulary<br />

of a given text with a reference text collection which is supposed to<br />

represent the general vocabulary of a language. After determining the frequencies<br />

of all word forms or tokens in both the text and the reference collection,<br />

there are four classes which help to identify a word:<br />

(1) words which exist in the analysed text, but are not part of the reference<br />

collection: there is a very high probability that these words belong to the<br />

speciic vocabulary of the domain which the text deals with;<br />

(2) words which exist both in the analysed text and the reference collection,<br />

but with the relative frequency in the analysed text being higher than in<br />

the general collection: if a predeined threshold value is exceeded, it is<br />

probable that these words also belong to the speciic terminology of the<br />

domain which the text speaks about;<br />

(3) words which exist both in the analysed text and the reference collection<br />

and whose relative frequencies are more or less the same: in general these<br />

words are necessary for the functioning of a language, they do not regard<br />

speciic subjects;<br />

(4) words which exist in the reference collection with higher relative frequencies<br />

than in the analysed text: with a very high probability these words do<br />

not contain terms of the subject matter treated by the analysed document.<br />

the advantage of this method is that the key terms of a given text can be<br />

identiied with a relatively high probability. the next step could consist of the<br />

classiication of the document with documents which deal with similar sub-<br />

01_2007_5222_txt_ML.indd 156 6-12-2007 15:14:05

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!