21.11.2013 Views

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

YEARS OF EUROPEAN ONLINE ANNÉES DE EN LIGNE ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

WORKSHOP<br />

space and their maintenance may be very time-consuming, because in many<br />

cases the inverted indexes have to be regenerated. this is why the expressions<br />

of an inverted list are reduced to those terms which have been collected in a<br />

controlled vocabulary.<br />

Boolean retrieval has some weaknesses. In particular, the quality of a retrieval<br />

result is very dificult to control. furthermore, in some cases the retrieval<br />

is considered to be too strict. A request such as A ∧ B ∧ C delivers results<br />

only if the data contains all three expressions; the result is empty if only<br />

two conditions are fulilled.<br />

Vector model<br />

In general, vector model-based systems lead to the best result in retrieval<br />

processes. Instead of focusing on single-index items from a microscopic point<br />

of view, the vector model starts from a macroscopic perspective. It supposes<br />

that documents are characterised by the statistical distribution of the terms.<br />

when a retrieval is started, a vector model-based system tries to identify those<br />

documents which suit the query best with regard to the statistics of the concerned<br />

terms. to do so, formulae have been developed which help to calculate<br />

the similarity between various documents.<br />

the calculation of distances between text documents is based on high dimensional<br />

vectors of features. the features identiied in all documents of a<br />

given collection are extracted. the sum of all these features creates a feature<br />

space. On the basis of predeined selection criteria, the number of these features<br />

is reduced. the elimination of so-called stop words is one of a number of<br />

applied techniques. Another one determines — by means of statistical analysis<br />

— particularly high or low frequent terms. At the end of this phase, which<br />

is generally called feature reduction, the feature space has n dimensions.<br />

It is now possible to describe each text of the collection by a vector. the<br />

value of an element in each dimension, v j , is deined by the feature matrix<br />

which corresponds to the document. the concrete value of v j depends on the<br />

applied methodology. Sometimes it is suficient to signal the presence or absence<br />

of a feature by the igures 0 and 1. In other cases, the absolute of normalised<br />

frequence is preferred. Normalisation is generally necessary to compensate<br />

for the varying length of documents. the distance between two texts can<br />

now be determined on the basis of the corresponding vectors. Simple values<br />

depend on the distance between the points deined by the vectors in an<br />

n-dimensional space or by the angle which is formed by the vectors. In general,<br />

the values are normalised to results which are placed between 0 and 1, for<br />

158 | 159<br />

01_2007_5222_txt_ML.indd 159 6-12-2007 15:14:06

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!