Combining Information from Multiple Internet Sources
Combining Information from Multiple Internet Sources
Combining Information from Multiple Internet Sources
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
UTILITY CLASSES:<br />
AgentController - these objects are used to control the SearchAgents lifespan.<br />
The ManagerAgent controls the life of its SearchAgents through the objects of<br />
this class. Each AgentController object is responsible for controlling one<br />
corresponding SearchAgent. Objects of this class can create, destroy and<br />
suspend the agent. This class is provided by the JADE framework.<br />
SearchEngineExplorer – objects of this class are used to utilize SearchEngine<br />
classes during the search process. Those objects are responsible for<br />
filtering URLs which should not be considered as results when the search<br />
results are received.<br />
LinkExtractor – objects of this class are responsible for retrieving the<br />
search results <strong>from</strong> the web. This is class is actually the class which<br />
queries the search engines and parses the result pages provided by the<br />
engines.<br />
Listing 2.2.2.2 Utility classes used in the Main module<br />
When the MA receives the SearchParams object <strong>from</strong> the web page it is unwrapped into two<br />
separate parameters. A List is a query divided into single Strings where each is a phrase<br />
<strong>from</strong> the query and String which is the selected processing algorithm name. Having the search<br />
parameters unwrapped, MA gets the SearchEngine objects <strong>from</strong> the database. Each SearchEngine<br />
configuration is stored in the database since each search engine utilizes different search parameters,<br />
such as HTTP parameters for controlling its output, that is parameter controlling number of<br />
displayed results and parameter that states the actual query. Also to each of the search engines there<br />
is a list of URLs which should be ignored when processing given web page with results. Those<br />
URLs are resources which are related to the given search engine, but not directly to its search<br />
process. For instance Google search engine has many hyperlinks pointing to Maps search, Image<br />
search and other services. Those URLs should not be considered as a part of the search results and<br />
therefore should be ignored during the URL retrieval process. <strong>Information</strong> about which URLs<br />
should be ignored is stored in the database and is extracted during SearchEngine creation process so<br />
that each SearchEngine has its list of “to be ignored” URLs assigned. Later on, when setup of the<br />
SearchEngines is finished those are stored in memory of the MA since SearchEngines are needed in<br />
further steps. Afterwards MA creates multiple SAs. The creation process is dynamic: there are to be<br />
as many SA created as there were SearchEngines retrieved <strong>from</strong> the database. During this creation<br />
process MA creates AgentControllers. Each AgentController is used to control on “real” SA and it<br />
immediately starts their lives. Afterwards, when SAs are started they wait for the query and<br />
13