A Task Scheduling Strategy for a Parallel Distributed Crawler

Version
Download 6
File Size 488.56 KB
File Count 1
Create Date February 4, 2021
Last Updated February 4, 2021

Download

Description

Abstract:

As the Internet becomes ever complicated, standalone crawlers are finding it ever difficult to efficiently scan the web for specific information. In continuation of our preliminary investigation, this paper proposes a task scheduling strategy that combines the features of round robin and batch mode for a parallel distributed crawler. Jobs are scheduled across several servers using a software system similar to the Enterprise Desktop Grid. A given target set of websites is scanned systematically as external hyperlinks are harvested. An algorithm that makes the longest-running task (of all the available tasks) be executed as fast as possible is proposed. This is aimed at minimizing the entire job execution process. Well-known scheduling methods – round-robin and batch job – are compared with the so called “mixed” mode. Advantages of the mixed mode over round-robin and batch job are reasoned.

Keywords: Enterprise desktop grid, BOINC, web crawling, Grid computing, round-robin, batch job.