04.11.2014 Views

elektronická verzia publikácie - FIIT STU - Slovenská technická ...

elektronická verzia publikácie - FIIT STU - Slovenská technická ...

elektronická verzia publikácie - FIIT STU - Slovenská technická ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

242 Selected Studies on Software and Information Systems<br />

Paper [64] similarly deals with ambiguity problems in person information mining on the<br />

Web. Authors propose five distinct features and a cascaded multiple-clusterer approach for<br />

name disambiguation using personal titles, community chains, contextual terms, temporal<br />

expressions and hostnames.<br />

Web Object Extraction. Web object extraction/retrieval is a new approach to information<br />

retrieval on the Web being invented at Microsoft Research. Current search engines<br />

are working at a document-level, ranking documents by their relevance to a set of keywords<br />

(query). However, these documents embed various kinds of objects along with their<br />

attributes such as people, products, papers 7 , organizations, etc. Web object extraction is<br />

aiming at extracting such objects to create an object-level vertical search engines (specialized<br />

on a particular domain). Such a search engine gives a list of object with explicit properties<br />

instead of list of URLS, which costs user’s significant efforts to decipher for needed information<br />

[49]. Moreover, authors of web object extraction method deal also with integration of<br />

the same object retrieved from multiple sources into one “real-world” object.<br />

Figure 8-9 depicts architecture of scientific papers extraction system. It is able to extract<br />

four types of objects (papers, authors, conferences and locations) and relationships between<br />

them. The architecture follows the method, where web crawler and classifier automatically<br />

collect all relevant webpages/documents that contain object information for a specific vertical<br />

domain. The crawled webpages/documents will be passed to the corresponding object<br />

extractor for extracting the structured object information and building the object warehouse.<br />

The task of aggregators is obvious: they aggregate information about the same object from<br />

multiple different data sources.<br />

The key point is object extraction itself. The problem is that webpages are generated by<br />

tens of thousands of different templates. One possible solution is to distinguish webpages<br />

generated by different templates, and then build an extractor for each template (called<br />

template-dependent solution). This solution is of no use in real world applications: firstly,<br />

it is practically impossible to collect all possible templates (even webpages from the same<br />

website may be generated by several different templates0. Secondly, it would be impossible<br />

to train and maintain of all required extractors for each template.<br />

Authors in [50] conducted an analysis of webpages across web sites and extracted some<br />

template-independent features from it:<br />

– Information about an object in a web page is generally grouped together as an object<br />

block, which can be further segmented into object elements, providing information<br />

about individual attributes.<br />

– Strong sequence characteristics exist for web objects of the same type across different<br />

web sites. For example, a person’s name is always ahead of contact information<br />

(telephone, postal address) in all the pages.<br />

Based on template-independent features authors propose template-independent method of<br />

Web Object Extraction based on linear-chain Conditional Random Fields (CRFs) and achieved<br />

7 A working object-level search engine for scientific papers can be found at http://libra.msra.cn/.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!