03.01.2015 Views

Combining Information from Multiple Internet Sources

Combining Information from Multiple Internet Sources

Combining Information from Multiple Internet Sources

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

UTILITY CLASSES:<br />

AgentController - these objects are used to control the SearchAgents lifespan.<br />

The ManagerAgent controls the life of its SearchAgents through the objects of<br />

this class. Each AgentController object is responsible for controlling one<br />

corresponding SearchAgent. Objects of this class can create, destroy and<br />

suspend the agent. This class is provided by the JADE framework.<br />

SearchEngineExplorer – objects of this class are used to utilize SearchEngine<br />

classes during the search process. Those objects are responsible for<br />

filtering URLs which should not be considered as results when the search<br />

results are received.<br />

LinkExtractor – objects of this class are responsible for retrieving the<br />

search results <strong>from</strong> the web. This is class is actually the class which<br />

queries the search engines and parses the result pages provided by the<br />

engines.<br />

Listing 2.2.2.2 Utility classes used in the Main module<br />

When the MA receives the SearchParams object <strong>from</strong> the web page it is unwrapped into two<br />

separate parameters. A List is a query divided into single Strings where each is a phrase<br />

<strong>from</strong> the query and String which is the selected processing algorithm name. Having the search<br />

parameters unwrapped, MA gets the SearchEngine objects <strong>from</strong> the database. Each SearchEngine<br />

configuration is stored in the database since each search engine utilizes different search parameters,<br />

such as HTTP parameters for controlling its output, that is parameter controlling number of<br />

displayed results and parameter that states the actual query. Also to each of the search engines there<br />

is a list of URLs which should be ignored when processing given web page with results. Those<br />

URLs are resources which are related to the given search engine, but not directly to its search<br />

process. For instance Google search engine has many hyperlinks pointing to Maps search, Image<br />

search and other services. Those URLs should not be considered as a part of the search results and<br />

therefore should be ignored during the URL retrieval process. <strong>Information</strong> about which URLs<br />

should be ignored is stored in the database and is extracted during SearchEngine creation process so<br />

that each SearchEngine has its list of “to be ignored” URLs assigned. Later on, when setup of the<br />

SearchEngines is finished those are stored in memory of the MA since SearchEngines are needed in<br />

further steps. Afterwards MA creates multiple SAs. The creation process is dynamic: there are to be<br />

as many SA created as there were SearchEngines retrieved <strong>from</strong> the database. During this creation<br />

process MA creates AgentControllers. Each AgentController is used to control on “real” SA and it<br />

immediately starts their lives. Afterwards, when SAs are started they wait for the query and<br />

13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!