Abstract:
As the Internet becomes ever complicated, standalone crawlers are finding it ever difficult to efficiently scan the web for specific information. In continuation of our preliminary investigation, this paper proposes a task scheduling strategy that combines the features of round robin and batch mode for a parallel distributed crawler. Jobs are scheduled across several servers using a software system similar to the Enterprise Desktop Grid. A given target set of websites is scanned systematically as external hyperlinks are harvested. An algorithm that makes the longest-running task (of all the available tasks) be executed as fast as possible is proposed. This is aimed at minimizing the entire job execution process. Well-known scheduling methods – round-robin and batch job – are compared with the so called “mixed” mode. Advantages of the mixed mode over round-robin and batch job are reasoned.
Keywords: Enterprise desktop grid, BOINC, web crawling, Grid computing, round-robin, batch job.