ABSTRACT:
The introduction of web crawling made search engines possible and easier. A web-crawler has a behavioral similitude with the spider. Most times, a web-crawler is often termed a spider because of its behavior which is linked to building nets and trapping information to those nets. This paper investigates how the web information can be crawled, the basic architecture of a focused web crawler, how data is stored and used for making decisions to curb cybersecurity issues. The paper adopts a review approach in studying the features that popular search engines use in getting information. The paper follows the principles of software engineering in developing an algorithm and a pseudocode for development of a web crawler. Conventional crawling takes into account two traversal techniques in dealing with data - Breadth First Search(BFS) and Depth First Search (DFS) techniques. Lastly, the developed pseudocode demonstrates how crawling can be viewed as a brute force attack and how to avoid it. In industry 4.0, cybersecurity is a big concern, which could be addressed through the implementation of a focused crawler to fetch data from the web, index it , build a knowledge domain for future referencing.
Keywords— Web-crawler, pseudocode, BFS, DFS, Search Engine, Cybersecurity