i4c – a novel distributed web crawler

The World Wide Web has grown from a few thousand of pages in 1993 to billions of pages at present [26]. According to a study released in October 2000, the World Wide Web consists of about 550 billion pages, 95% of which are publicly accessible [21]. However,  Google, the largest search engine, in June 2000 contained 560 million full-text-indexed pages [13] which is only about 0.1% of the publicly accessible web.

The role of distributed crawler systems in the field of data mining becomes more and more important, but their designs are not well-documented in literature and their source codes are not opened to public.

Our project, i4c crawler, is  a flexible, extensible and portable distributed web crawler system written entirely in Java programming language. The main features of i4c are platform independent, easy installation, fault toleration, innovative assigning mechanism and balance locally workload.

Web Crawler

A web crawler, robot or spider is a program or suite of programs that is capable of iteratively and automatically downloading web pages, extracting URLs from their HTML and fetching them [9,20].

Distributed Web Crawler

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. The idea is to spread out the required resources of computation and bandwidth to many computers and networks [30].

  • By flexible, we mean that our program can either run in standalone mode or join a distributed system. The client module is also flexible in configuration which allows user to decide the number of threads to run on each client based on the machine performance. This feature will utilize the machine performance in order to maximize the productivity of the crawling process.
  • By extensible, we mean that our system is following the Close and Open Principle of Object Oriented Design [20], the software is “closed for modification” but “opened for extension”. Each component is written in modular structure so it is convenient for future development when we want extend the current system simply by plugging new modules. For instances, Scheduler is a module that is plugged  at the later phase of the software because we want to run the testing in SIT labs
  • By portable, we mean that our system is environment independent which mean it can operate without the installation any other software. Moreover, if the program is running in distributed mode in a local network, any computer in the network would play the role of Central Server, therefore we can have many Central Servers running at the same time, and the client would choose which server to join.
  • By work balance, we mean that our program will keep the work balance for each client in the system. Our work balance is estimated base on the performance of the machine, not on the number of task assign from central component. When client start, user decide the number of threads to run on this machine, then each thread will responsible for crawling one single host. When a thread finishes its task, it will ping server for a new host and continue crawling, so that if we start a client with five threads, then it will always maintain five threads running at all time. The client is running and utilize its own ability, it is not enforced by server.

About the project

i4c crawler is the output of my last semester project in MIT degree.

- Project title: Building a Large Scale Image Database.
- Supervisor: Dr Zhiyong Wang. School of IT, University of Sydney.
- Email:  z.wang@usyd.edu.au
- Submission date: June 9, 2009.

Even thought the purpose of the project is developing “a distributed crawler system that would browse the internet and search for image files, then download these files, store on the hard disk, to serve for particular academic purposes”, i4c crawler has gone beyond that target.

Some statistic numbers of the project:

  • 10 packages
  • 97 classes
  • 20,816 lines of codes

For detail documentation of analysis, design, implementation, evaluation as well as references please refer to the downloads section at the end of this entry

System Architecture:

The system contains three main components, which are Central Server, Distributed Crawler and Messengers. Below is the Sequence Diagram showing how the components interact with each other in the system.

UML

Sequences diagram:

Sequence diagram

Assignment Strategy

The Host Distributor uses an innovative assignment strategy by applying the theory of ping architecture [10]. In the traditional assignment strategy, distributor  will select the jobs from queue and push to available crawlers in the system, that why it is called push architecture. The disadvantage of this strategy is that central component just assigns jobs to client without the concern about client’s state which would lead to unbalance workload and sometime work overload at clients.

In the ping architecture, server is no longer automatically push jobs to client, the assignment is working another way around: the client actively ping server to asks for job. Below is the algorithm of assignment:

Flow chart

Sample System with 3 distributed crawlers

sample

Crash Recovery Strategy

During the process of crawling, the server holds a list of [Crawler IP, Host] of each crawler joined the system. When a client dies, the server will notice that a client is no longer connecting, then server will send all pairs of [Crawler IP, Host] that the client is working on to the table UnfinishedHost of database while other clients are still working.

crash

Then, when the crashed client is re-active, it will ping the server, server will check in the UnfinishedHost list:

- If  the client IP is in table UnfinishedHost, then that client still has some unfinished jobs.

- Server will get the number of the unfinished jobs  (U)  of that client and compare with the number of threads (T) input from the client at this time:

  • if U > T or U = T then server will get a list of T unfinished hosts and assign to the client.
  • If U < T, server will get all U unfinished hosts and get ( T – U) new hosts, assign to that client so that all threads has a host seed to work.
  • If U < T, and there is no more new hosts, then server will send all U new hosts to client and ask that client to decrease the number of threads to U.

The client then start a crawling process by firstly look into the cache, retrieves the crawled files before, then do the crawling.

reactive

Crawling strategy

There are many crawling strategies that has been invented and used by search engines, such as breadth-first, backlink-count, On-line Page Importance Computation and partial Pagerank, the one used by Google [14].
The i4c client crawler use the breadth-first strategy which has been used and tested in [24,8]. Breadth-first-search is a graph search algorithm that begins at the root node and explores all the neighboring nodes. Then for each of those nearest nodes, it explores their unexplored neighbor nodes, and so on, until it finds the goal [18].

300px-Breadth-first-tree.svg

Downloader is responsible for downloading documents by sending HTTP request and store the downloaded documents on local disk. The downloader is designed to respect the polite policy and access authentication by applying the robot exclusion standard and NTLM.

As indicated by [19], the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include:

  • network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time;
  • server overload, especially if the frequency of accesses to a given server is too high;
  • poorly-written crawlers, which can crash servers or routers, or which download pages they cannot handle
  • personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable [33].

Beside ,the Downloader also implements the delay interval concept as recommended in [5] and [35] which leave a spare period of several seconds between requests.

NTLM (NT LAN Manager) is a Microsoft authentication protocol used with the SMB protocol. The protocol uses a challenge-response sequence requiring the transmission of three messages between the client (wishing to authenticate) and the server (requesting authentication) [34,23]:

  • The client first sends a Type 1 message containing a set of flags of features supported or requested (such as encryption key sizes, request for mutual authentication, etc.) to the server.
  • The server responds with a Type 2 message containing a similar set of flags supported or required by the server (thus enabling an agreement on the authentication parameters between the server and the client) and, more importantly, a random challenge (8 bytes).
  • Finally, the client uses the challenge obtained from the Type 2 message and the user’s credentials to calculate the response. The calculation methods differ based on the NTLM authentication parameters negotiated previously, but in general they apply MD4/MD5 hashing algorithms and DES encryption to compute the response. The client then sends the response to the server in a Type 3 message.

Screen shots:

client_up

Distributed Client

server_up

Central Server

frame_view_up

Downloaded file view

Downloads:
Documentation:i4c_crawler_documentation.rar
Source code:i4c_crawler_source.rar

NDLoc, 26/06/2009

June 23, 2009    Posted in: SEO, Web crawler

Leave a Reply