Saturday, August 8, 2009

implimenting

Abstract

In large distributed hypertext system like the World-Wide Web, users find resources by following hypertext links. As the size of the system increases the users must traverse increasingly more links to find what they are looking for, until precise navigation becomes impractical. The WebCrawler is a tool that solves these problems by indexing and automatically navigating the Web. It creates an index by an incomplete breadth-first traversal, then relies on an automatic navigation mechanism to fill in the holes. The WebCrawler indexes both document titles and document content using a vector space model. Users can issue queries directly to the pre-computed index or to a search program that explores new documents in real time. The database the WebCrawler builds is available through a search page on the Web. Experience with this index suggests that the WebCrawler is collecting enough information from the Web, and that while the current query interface is adequate, more sophisticated retrieval tools are necessary. Real-time searches, although not available to the public, are fast and seem well targeted towards what what the user wants.

Introduction

Imagine trying to find a book in a library without a card catalog. It is not an easy task! Finding specific documents in a distributed hypertext system like the World-Wide Web can be just as difficult. To get from one document to another users follow links that identify the documents. A user who wants to find a specific document in such a system must choose among the links, each time following links that may take her further from her desired goal. In a small, unchanging environment, it might be possible to design documents that did not have these problems. But the World-Wide Web is decentralized, dynamic, and diverse; navigation is difficult, and finding information can be a challenge. This problem is called the resource discovery problem.

The WebCrawler is a tool that solves the resource discovery problem in the specific context of the World-Wide Web. It provides a fast way of finding resources by maintaining an index of the Web that can be queried for documents about a specific topic. Because the Web is constantly changing and indexing is done periodically, the WebCrawler includes a second searching component that automatically navigates the Web on demand. The WebCrawler is best described as a "Web robot." It accesses the Web one document at a time, making local decisions about how best to proceed. In contrast to more centralized approaches to indexing and resource discovery, such as ALIWEB and Harvest [Koster, Bowman], the WebCrawler operates using only the infrastructure that makes the Web work in the first place: the ability of clients to retrieve documents from servers.

Although the WebCrawler is a single piece of software, it has two different functions: building indexes of the Web and automatically navigating on demand. For Internet users, these two solutions correspond to different ways of finding information. Users can access the centrally maintained WebCrawler Index using a Web browser like Mosaic, or they can run the WebCrawler client itself, automatically searching the Web on their own.

As an experiment in Internet resource discovery, the WebCrawler Index is available on the World-Wide Web. The index covers nearly 50,000 documents distributed on over 9000 different servers, answers over 6000 queries per day and is updated weekly. The success of this service shows that non-navigational resource discovery is important and validates some of the design choices made in the WebCrawler. Three of these choices stand out: content-based indexing to provide a high-quality index, breadth-first searching to create a broad index, and robot-like behavior to include as many servers as possible.

The first part of this paper describes the architecture of the WebCrawler itself, and some of the tradeoffs made in its design. The second part of the paper discusses some of the experience I have gained in operating the WebCrawler Index.

The Design of the WebCrawler

The most important feature of the World-Wide Web today is its decentralized nature. Anyone can add servers, documents, and hypertext links. For search tools like the WebCrawler, such a dynamic organization means that discovering new documents is an important piece of the design. For the WebCrawler itself, "discovering documents" means learning their identities; on the Web, those identities currently take the form of Uniform Resource Locators (URLs). The URLs can refer to nearly any kind of retrievable network resource; once the WebCrawler has a URL it can decide whether the document is worth searching and retrieve it if appropriate.

After retrieving a document, the WebCrawler performs three actions: it marks the document as having been retrieved, deciphers any outbound links (href's), and indexes the content of the document. All of these steps involve storing information in a database.

Figure 1: Software components of the WebCrawler

The WebCrawler has a software component for each of the activities described above; the components fit together as illustrated in Figure 1. In this architecture, the search engine directs the WebCrawler's activities. It is responsible for deciding which new documents to explore and for initiating their retrieval. The database handles the persistent storage of the document metadata, the links between documents, and the full-text index. The "agents" are responsible for retrieving the documents from the network at the direction of the search engine. Finally, the Query Server implements the query service provided to the Internet. Each of these components is described in detail below.

The Search Engine

The WebCrawler discovers new documents by starting with a known set of documents, examining the outbound links from them, following one of the links that leads to a new document, and then repeating the whole process. Another way to think of it is that the Web is a large directed graph and that the WebCrawler is simply exploring the graph using a graph traversal algorithm.

Figure 2 shows an example of the Web as a graph. Imagine that the WebCrawler has already visited document A on Server 1 and document E on Server 3 and is now deciding which new documents to visit. Document A has links to document B, C and E, while document E has links to documents D and F. The WebCrawler will select one of documents B, C or D to visit next, based on the details of the search it is executing.

The search engine is responsible not only for determining which documents to visit, but which types of documents to visit. Files that the WebCrawler cannot index, such as pictures, sounds, PostScript or binary data are not retrieved. If they are erroneously retrieved, they are ignored during the indexing step. This file type discrimination is applied during both kinds of searching. The only difference between running the WebCrawler in indexing mode and running it in real-time search mode is the discovery strategy employed by the search engine.

Figure 2: The Web as a graph.

Indexing mode: In indexing mode, the goal is to build an index of as much of the Web as possible. If the WebCrawler Index has enough space for 50,000 documents, which documents should those be? For a Web index, one solution is that those documents should come from as many different servers as possible. The WebCrawler takes the following approach: it uses a modified breadth-first algorithm to ensure that every server has at least one document represented in the index. The strategy is effective. The most frequent feedback about the WebCrawler is that it has great coverage and that nearly every server is represented!

In detail, a WebCrawler indexing run proceeds as follows: every time a document on a new server is found, that server is placed on a list of servers to be visited right away. Before any other documents are visited, a document on each of the new servers is retrieved and indexed. When all known servers have been visited, indexing proceeds sequentially through a list of all servers until a new one is found, at which point the process repeats. While indexing, the WebCrawler runs either for a certain amount of time, or until it has retrieved some number of documents. In the current implementation, the WebCrawler builds an index at the rate of about 1000 documents an hour on a 486-based PC running NEXTSTEP.

Real-time search mode: In real-time search mode, where the goal is to find documents that are most similar to a user's query, the WebCrawler uses a different search algorithm. The intuition behind the algorithm is that following links from documents that are similar to what the user wants is more likely to lead to relevant documents than following any link from any document. This intuition roughly captures the way people navigate the Web: they find a document about a topic related to what they are looking for and follow links from there.

To find an initial list of similar documents, the WebCrawler runs the user's query against its index. A single document can also be used as a starting point, but using the index is much more efficient. From the list, the most relevant documents are noted, and any unexplored links from those documents are followed. As new documents are retrieved, they are added to the index, and the query is re-run. The results of the query are sorted by relevance, and new documents near the top of the list become candidates for further exploration. The process is iterated either until the WebCrawler has found enough similar documents to satisfy the user or until a time limit is reached.

One problem with this approach is that the WebCrawler blindly follows links from documents, possibly leading it down an irrelevant path. For instance, if the WebCrawler is searching for pages about molecular biology it may come across a page that has links to both molecular biology resources and unrelated network resources. When people navigate, they choose links based on the anchor text, the words that describe a link to another document, and tend to follow a directed path to their destination. Ideally, the WebCrawler should choose among several of these links, preferring the one that made the most sense. Although the WebCrawler's reasoning ability is somewhat less than that of a human, it does have a basis for evaluating each link: the similarity of the anchor text to the user's query. To evaluate this similarity, the WebCrawler makes a small full-text index from the anchor text in a document and applies the users query to select the most relevant link. Searching the anchor text in this way works well, but anchor texts are usually short and full-text indexing does not work as well as it could. More sophisticated full-text tools would help greatly, particularly the application of a thesaurus to expand the anchor text.

The idea of following documents from other, similar ones was first demonstrated to work in the Fish search [DeBra]. The WebCrawler extends that concept to initiate the search using the index, and to follow links in an intelligent order.

Agents

To actually retrieve documents from the Web, the search engine invokes "agents." The interface to an agent is simple: "retrieve this URL." The response from the agent to the search engine is either an object containing the document content or an explanation of why the document could not be retrieved. The agent uses the CERN WWW library (libWWW), which gives it the ability to access several types of content with several different protocols, including HTTP, FTP and Gopher [Berners-Lee].

Since waiting for servers and the network is the bottleneck in searching, agents run in separate processes and the WebCrawler employs up to 15 in parallel. The search engine decides on a new URL, finds a free agent, and asks the agent to retrieve that URL. When the agent responds, it is given new work to do. As a practical matter, running agents in separate processes helps isolate the main WebCrawler process from memory leaks and errors in the agent and in libWWW.

The Database

The WebCrawler's database is comprised of two separate pieces: a full-text index and a representation of the Web as a graph. The database is stored on disk, and is updated as documents are added. To protect the database from system crashes, updates are made under the scope of transactions that are committed every few hundred documents.

The full-text index is currently based on NEXTSTEP's IndexingKit [NeXT]. The index is inverted to make queries fast: looking up a word produces a list of pointers to documents that contain that word. More complex queries are handled by combining the document lists for several words with conventional set operations. The index uses a vector-space model for handling queries [Salton].

To prepare a document for indexing, a lexical analyzer breaks it down into a stream of words that includes tokens from both the title and the body of the document. The words are run through a "stop list" to prevent common words from being indexed, and they are weighted by their frequency in the document divided by their frequency in a reference domain. Words that appear frequently in the document, and infrequently in the reference domain are weighted most highly, while words that appear infrequently in either are given lower weights. This type of weighting is commonly called peculiarity weighting.

The other part of the database stores data about documents, links, and servers. Entire URLs are not stored; instead, they are broken down into objects that describe the server and the document. A link in a document is simply a pointer to another document. Each object is stored in a separate btree on disk; documents in one, servers in another, and links in the last. Separating the data in this way allows the WebCrawler to scan the list of servers quickly to select unexplored servers or the least recently accessed server.

The Query Server

The Query Server implements the WebCrawler Index, a search service available via a document on the Web [Pinkerton]. It covers nearly 50,000 documents distributed across 9000 different servers, and answers over 6000 queries each day. The query model it presents is a simple vector-space query model on the full-text database described above. Users enter keywords as their query, and the titles and URLs of documents containing some or all of those words are retrieved from the index and presented to the user as an ordered list sorted by relevance. In this model, relevance is the sum, over all words in the query, of the product of the word's weight in the document and its weight in the query, all divided by the number of words in the query. Although simple, this interface is powerful and can find related documents with ease. Some sample query output is shown in below in Figure 3.

WebCrawler Search Results

Search results for the query "molecular biology human genome project":

1000 Guide to Molecular Biology Databases
0754 Guide to Molecular Biology Databases
0562 Biology Links
0469 bionet newsgroups
0412 Motif BioInformatics Server
0405 LBL Catalog of Research Abstracts (1993)
0329 Molecular Biology Databases
0324 GenomeNet WWW server
0317 DOE White Paper on Bio-Informatics
0292 Molecular biology
0277 Oxford University Molecular Biology Data Centre Home Page
0262 Biology Internet Connections
0244 Harvard Biological Laboratories - Genome Research
0223 About the GenomeNet
0207 Biology Index

Figure 3: Sample WebCrawler results. For brevity, only the first 15 of 50 results are shown. The numbers down the left side of the results indicate the relevance of the document to the query, normalized to 1000 for the most relevant document. At the top of the list, two documents have the same title. These are actually two different documents, with different URLs and different weights.

The time required for the Query Server to respond to a query averages about an eighth of a second. That interval includes the time it takes to parse the query, call the database server, perform the query, receive the answer, format the results in HTML and return them to the HTTP server. Of that time, about half is spent actually performing the query.

Observations and Lessons

The WebCrawler has been active on the Web for the last six months. During that time, it has indexed thousands of documents, and answered nearly a quarter of a million queries issued by over 23,000 different users. This experience has put me in touch with many users and taught me valuable lessons. Some of the more important ones are summarized here. My conclusions are based on direct interaction with users, and on the results of an on-line survey of WebCrawler users that has been answered by about 500 people.

Full-text indexing is necessary

The first lesson I learned is that content-based indexing is necessary, and works very well. At the time the WebCrawler was unleashed on the Web, only the RBSE Spider indexed documents by content [Eichmann]. Other indexes like the Jumpstation and WWWW did not index the content of documents, relying instead on the document title inside the document, and anchor text outside the document [Fletcher, McBryan].

Indexing by titles is a problem: titles are an optional part of an HTML document, and 20% of the documents that the WebCrawler visits do not have them. Even if that figure is an overestimate of the missing titles in the network as a whole, basing an index only on titles would omit a significant fraction of documents from the index. Furthermore, titles don't always reflect the content of a document.

By indexing titles and content, the WebCrawler captures more of what people want to know. Most people who use the WebCrawler find what they are looking for, although it takes them several tries. The most common request for new features in the WebCrawler is for more advanced query functionality based on the content index.

Precision is the limiting factor in finding documents

Information retrieval systems are usually measured in two ways: precision and recall [Salton]. Precision measures how well the retrieved documents match the query, while recall indicates what fraction of the relevant documents are retrieved by the query. For example, if an index contained ten documents, five of which were about dolphins, then a query for `dolphin-safe tuna' that retrieved four documents about dolphins and two others would have a precision of 0.66 and a recall of 0.80.

Recall is adequate in the WebCrawler, and in the other indexing systems available on the Internet today: finding enough relevant documents is not the problem. Instead, precision suffers because these systems give many false positives. Documents returned in response to a keyword search need only contain the requested keywords and may not be what the user is looking for. Weighting the documents returned by a query helps, but does not completely eliminate irrelevant documents.

Another factor limiting the precision of queries is that users do not submit well-focused queries. In general, queries get more precise as more words are added to them. Unfortunately, the average number of words in a query submitted to the WebCrawler is 1.5, barely enough to narrow in on a precise set of documents. I am currently investigating new ways to refine general searches and to give users the ability to issue more precise queries.

Breadth-first searching at the server level builds a useful index

The broad index built by the breadth-first search ensures that every server with useful content has several pages represented in the index. This coverage is important to users, as they can usually naviagte within a server more easily than navigating across servers. If a search tool identifies a server as having some relevant information, users will probably be able to find what they are looking for.

Using this strategy has another beneficial effect: indexing documents from one server at a time spreads the indexing load among servers. In a run over the entire Web, each server might see an access every few hours, at worst. This load is hardly excessive, and for most servers is lost in the background of everyday use. When the WebCrawler is run in more restricted domains (for instance, here at the University of Washington), each server will see more frequent accesses, but the load is still less than that of other search strategies.

Robot-like behavior is necessary now

The main advantage of the WebCrawler is that it is able to operate on the the Web today. It uses only the protocols that make the Web operate and does not require any more organizational infrastructure to be adopted by service providers.

Still, Web robots are often criticized for being inefficient and for wasting valuable Internet resources because of their uninformed, blind indexing. Systems like ALIWEB and Harvest are better in two ways. First, they use less network bandwidth than robots by locally indexing and transferring summaries only once per server, instead of once per document. Second, they allow server administrators to determine which documents are worth indexing, so that the indexing tools need not bother with documents that are not useful.

Although Web robots are many times less efficient than they could be, the indexes they build save users from following long chains before they find a relevant document, thus saving the Internet bandwidth. For example, if the WebCrawler indexes 40,000 documents and gets 5000 queries a day, and if each query means the user will retrieve just three fewer documents than she otherwise would have, then it will take about eight days for the WebCrawler's investment in bandwidth to be paid back.

Another advantage of the document-by-document behavior of Web robots is that they can apply any indexing algorithm because they have the exact content of the document to work from. Systems that operate on summaries of the actual content are dependent on the quality of those summaries, and to a lesser extent, on the quality of their own indexing algorithm. One of the most frequent requests for new WebCrawler features is a more complex query engine. This requires a more complex indexing engine, for which original document content may be necessary. Putting such indexing software on every server in the world may not be feasible.

To help robots avoid indexing useless or unintelligible documents, Martin Koster has proposed a standard that helps guide robots in their choice of documents [Koster2]. Server administrators can advise robots by specifying which documents are worth indexing in a special "robots.txt" document on their server. This kind of advice is valuable to Web robots, and increases the quality of their indexes. In fact, some administrators have gone so far as to create special overview pages for Web robots to retrieve and include.

Widespread client-based searching is not yet practical

The future of client-initiated searching, like running the WebCrawler in on-demand mode, is still in doubt because of the load it places on the network. It is one thing for a few robots to be building indexes and running experiments; it is quite another for tens of thousands of users to be unleashing multiple agents under robotic control. Fortunately, most robot writers seem to know this and have declined to release their software publicly.

The key difference between running a private robot and running an index-building one is that the results of the private robot are only presented to one person. They are not stored and made available for others, so the cost of building them is not amortized across many queries.

Future Work

In the short term, the WebCrawler will evolve in three ways. First, I am exploring better ways to do client-based searching. This work centers on more efficient ways to determine which documents are worth retrieving and strategies that allow clients to perform searches themselves and send the results of their explorations back to the main WebCrawler index. The rationale behind the second idea is that if the results of these explorations are shared, then searching by clients may be more feasible.

Second, I am exploring possibilities for including a more powerful full-text engine in the WebCrawler. Several new features are desirable; the two most important are proximity searching and query expansion and refinement using a thesaurus. Proximity searching allows several words to be considered as a phrase, instead of as independent keywords. Using a thesaurus is one way to help target short, broad queries: a query of biology may find many documents containing the word biology, but when expanded by a thesaurus to include more terms about biology, the query will more precisely recognize documents about biology. Finally, I plan on working with the Harvest group to see if the WebCrawler can fit into their system as a broker [Bowman].

Conclusions

The WebCrawler is a useful Web searching tool. It does not place an undue burden on individual servers while building its index. In fact, the frequent use of its index by thousands of people saves the Internet bandwidth. The real-time search component of the WebCrawler is effective at finding relevant documents quickly, but it has the aforementioned problem of not scaling well given the limited bandwidth of the Internet.

In building the WebCrawler, I have learned several lessons. First, full-text indexing is worth the extra time and space it takes to create and maintain the index. In a Web robot, there is no additional load network load imposed by full-text index; the load occurs only at the server. Second, when large numbers of diverse documents are present together in an index, achieving good query precision is difficult. Third, using a breadth-first search strategy is a good way to build a Web-wide index.

The most important lesson I have learned from the WebCrawler and other robots like it is that they are completely consistent with the character of the decentralized Web. They require no centralized decision making and no participation from individual site administrators, other than their compliance with the protocols that make the Web operate in the first place. This separation of mechanism and policy allows each robot to pick its policy and offer its own unique features to users. With several choices available, users are more likely to find a search tool that matches their immediate needs, and the technology will ultimately mature more quickly.

Friday, August 7, 2009

implimentation

Implementing an Effective Web Crawler
Introduction

Web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical and automated manner.

This process is called Web crawling or spidering. Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, which will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML codes. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.
Why do we need a web crawler?

Following are some reasons to use a web crawler:

* To maintain mirror sites for popular Web sites.
* To test web pages and links for valid syntax and structure.
* To monitor sites to see when their structure or contents change.
* To search for copyright infringements.
* To build a special-purpose index.for example, one that has some understanding of the content stored in multimedia files on the Web.

How does a web crawler work?

A typical web crawler starts by parsing a specified web page: noting any hypertext links on that page that point to other web pages. The Crawler then parses those pages for new links, and so on, recursively. A crawler is a software or script or automated program which resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links.

This is the basic concept behind implementing web crawler, but implementing this concept is not merely a bunch of programming. The next section describes the difficulties involved in implementing an efficient web crawler.
Difficulties in implementing efficient web crawler

There are two important characteristics of the Web that generate a scenario in which Web crawling is very difficult:

1. Large volume of Web pages.
2. Rate of change on web pages.

A large volume of web page implies that web crawler can only download a fraction of the web pages and hence it is very essential that web crawler should be intelligent enough to prioritize download.

Another problem with today.s dynamic world is that web pages on the internet change very frequently, as a result, by the time the crawler is downloading the last page from a site, the page may change or a new page has been placed/updated to the site.
Solutions - Right strategies

The difficulties in implementing efficient web crawler clearly state that bandwidth for conducting crawls is neither infinite nor free. So, it is becoming essential to crawl the web in not only a scalable, but efficient way, if some reasonable amount of quality or freshness of web pages is to be maintained. This ensues that a crawler must carefully choose at each step which pages to visit next.

Thus the implementer of a web crawler must define its behavior.

Defining the behavior of a Web crawler is the outcome of a combination of below mentioned strategies:

* Selecting the better algorithm to decide which page to download.
* Strategizing how to re-visit pages to check for updates.
* Strategizing how to avoid overloading websites.

Selecting the right algorithm

Given the current size of the web, it is essential that the crawler program should crawl on a fraction of the web. Even large search engines in today.s dynamic world crawls fraction of web pages from web. But, a crawler should observe that the fraction of pages crawled must be most relevant pages, and not just random pages.

While selecting the search algorithm for the web crawler an implementer should keep in mind that algorithm must make sure that web pages are chosen depending upon their importance. The importance of a web page lies in its popularity in terms of links or visits, or even its URL.
Algorithm types

Path-ascending crawling

We intend the crawler to download as many resources as possible from a particular Web site. That way a crawler would ascend to every path in each URL that it intends to crawl. For example, when given a seed URL of http://foo.org/a/b/page.html, it will attempt to crawl /a/b/, /a/, and /.

The advantage with Path-ascending crawler is that they are very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling.

Focused crawling

The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. In this strategy we can intend web crawler to download pages that are similar to each other, thus it will be called focused crawler or topical crawler.

The main problem in focused crawling is that in the context of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. A possible predictor is the anchor text of links; to resolve this problem proposed solution would be to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually relies on a general Web search engine for providing starting points.

How to Re-visit web pages

The optimum method to re-visit the web and maintain average freshness high of web page is to ignore the pages that change too often.

The approaches could be:

* Re-visiting all pages in the collection with the same frequency, regardless of their rates of change.
* Re-visiting more often the pages that change more frequently.

(In both cases, the repeated crawling order of pages can be done either at random or with a fixed order.)

The re-visiting methods considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario.
How to avoid overloading websites

Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.

The use of Web crawler is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include:

* Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time.
* Server overload, especially if the frequency of accesses to a given server is too high.
* Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle.
* Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

To resolve this problem we can use robots exclusion protocol, also known as the robots.txt protocol.

The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. We can specify the top level directory of web site in a file called robots.txt and this will prevent the access of that directory to crawler.

This protocol uses simple substring comparisons to match the patterns defined in robots.txt file. So, while using this robots.txt file we need to make sure that we use final ./. character appended to directory path. Else, files with names starting with that substring will be matched rather than directory.

Example of robots.txt files that tells all crawlers not to enter into four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Web crawler architectures

A crawler must have a good crawling strategy, as noted in the previous sections, but it also needs a highly optimized architecture.
Pseudo code for a web crawler

Here's a pseudo code summary of the algorithm that can be used to implement a web crawler:

Ask user to specify the starting URL on web and file type that crawler should crawl.

Add the URL to the empty list of URLs to search.

While not empty ( the list of URLs to search )
{

Take the first URL in from the list of URLs
Mark this URL as already searched URL.

If the URL protocol is not HTTP then
break;
go back to while

If robots.txt file exist on site then
If file includes .Disallow. statement then
break;
go back to while

Open the URL

If the opened URL is not HTML file then
Break;
Go back to while

Iterate the HTML file

While the html text contains another link {

If robots.txt file exist on URL/site then
If file includes .Disallow. statement then
break;
go back to while

If the opened URL is HTML file then
If the URL isn't marked as searched then
Mark this URL as already searched URL.

Else if type of file is user requested
Add to list of files found.

}
}

webcrawler

1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Screen shot of web page for adding a URL to Google.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.
2. Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.

To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.
3. Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.

PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank.

Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. A patent application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for an interpretation of the concepts and the practical applications contained in Google’s patent application.

Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).

About Crawler

A Web Crawler – sometimes referred to as a spider or robot – is a process that visits a number of web pages programmatically, usually to extract some sort of information. For example, the popular search engine Google has a robot called googlebot that sooner or later visits virtually every page on the Internet for the purpose of indexing the words on that page. We are going to develop a general-purpose class that can be used as a basis for writing any type of robot. This class will be simple yet powerful. The heart of the class is a method called CrawlURL, which accepts a beginning URL. The contents of this URL will be loaded into the HTML Container class. For each link found on this page, it recurs back again, thus repeating the process for each of those pages and so on.

The basic process is pretty simple, but we must add a few more features in order to avoid spiralling into an infinite loop. First of all, we want to keep track of pages that we’ve already seen. Many sites are such that Page A points to Page B which points back again to Page A – and the routine will soon be chasing its own tail if we don’t prevent it from doing so. A related problem has to do with the allowable recursion level. If left unbound, many sites will cause the crawler to dig itself into a hopelessly deep hole, using large amounts of stack space and memory. It is also possible to encounter a submission form that will point back to itself, using a parameterized URL that fools the check for already encountered links. For the sake of usability, we want to restrict the range so we can target a specific site or group of sites, while ignoring everything else. Finally, as we will see later, Robot etiquette requires that we maintain an Exclude list so we might as well expose this function to the object caller as well.

The first problem could be solved in a number of ways. We could have an array or collection representing the known URLs and a routine that would check to see if a given URL was already in the list. Actually we can save a bit of work by using the .Net Queue class. This class has two methods that will be useful for this purpose – the Enqueue method, which adds an item to a queue, and the Contains method, that returns True if a given item is in the queue or False if not – exactly the functions needed for the task at hand.

A property will be added to set or indicate the maximum recursion level. We will write a private version of CrawlURL, which accepts a URL as does the public method, but also expects a recursion level parameter. The public interface will simply invoke the private method, passing a zero and the process will continue from there. When the maximum level has been reached, we will stop the process. When this happens, we don’t want to just throw away the links. We may drop pages that are not accessible from another path if we do. We will use another queue that is populated with links that would have been visited had we not exceeded the maximum level. These links will in turn be visited whenever the initial recursion has completed and the process will start again. A lower recursion level will save memory, but not allow any links to be missed completely.

The Include list will be optional. If none is specified, any URL will be allowed unless found on the Exclude list. If one or more are present, the URL in question will be required to match the Include. A partial path will be allowed, so the path:

http://www.bbc.co.uk/sports
…will include or exclude (depending on the list) anything that matches to that point.

A well-mannered robot

As with just about everything in life, there are certain rules that should be followed when writing a robot. There are two important bits of information that every robot should check before visiting a specific link. The first of these is the robots.txt file. This file may or may not be present – but if it is, it will be found in the root directory of the server. For example, the BBC’s robot.txt file can be found at:
http://www.bbc.co.uk/robots.txt
A robots.txt file consists of two parts – the robot name and the list of excludes meant for that specific robot. An asterisk (*) is used to indicate any robot not explicitly named – us for example. Here is a sample entry from the BBC robots.txt file:
User-agent: *

Disallow: /cgi-bin
This tells us that we should not index any files found in the /cgi-bin path, which for this page would mean everything that begins with:
http://www.bbc.co.uk/cgi-bin/
Two important things should be noted about the robots.txt file. First of all, it has nothing to do with security. You can, if you wish, ignore the robots.txt file and traverse to your heart’s content – but it is not considered polite to do so. Secondly, it is often the case that the excluded paths contain duplicated or otherwise uninteresting information that you probably didn’t want to visit anyway. Often the robots.txt file is maintained by the webmaster as a courtesy to the robot writer rather than a hindrance.

While the robots.txt file applies to the entire site, a given page can also have meta-tags that contain robot information. There are two of these that we should be concerned with here – NoIndex and NoFollow, which may appear together, separately or not at all. Here is an example of a meta-tag that contains both values:

	nofollow”>
The NoIndex attribute requests that text not be indexed from that page. The NoFollow attribute requests that none of the links on that page be crawled.

We are just about ready to write the CrawlURL method, but first let’s take a look at the HTML Container class. There are two methods that we will find particularly useful: LoadSource and GetHRefs. The first of these will grab a document given a URL. The second will extract every HRef element from the anchor tags on that page. Perfect – except that it would have been nice if the HTML Document class had included NoIndex and NoFollow properties to save us the trouble of checking them ourselves.

Extending the HTML container class

Well, of course there is an easy solution – inheritance. We will create a new HTML Container class that will inherit the original, while adding two additional properties: NoIndex, True if a NoIndex meta-tag is present, and NoFollow, True if a NoFollow meta-tag is present. Recalling the HTML Document class again, we find the LoadStatus event, which will raise with a Description of either “Complete” or “Error” when the page has completed loading or failed to load. This sounds like a good place to check the document’s meta-tags. We will write a private routine SetRobotsFlags, which will set modular level Boolean variables corresponding to the NoIndex and NoFollow properties depending on the presence or absence of their corresponding meta-tags. When a page has been loaded successfully, this routine will be called to set the flags. In the result of an error, we will set both of these flags to True.

Defining the classes and events

To create a new class based on an existing class we declare the class, and then use the Inherited keyword for VB.NET or suffixing the class name with a colon, followed by the base class for C#. The modified HTML Container will be a stand-alone class – but contained in the same namespace as the WebCrawler class:
public class


PageComplete;

Developing the main engine

We are finally ready for the fun part. We begin by providing the public interface for CrawlURL, which accepts the initial URL supplied by the caller. In the opening section, we discussed the need to keep track of the recursion level, and to save any links on a page that exceeds the limit. These both will be significant here, as we shall soon see. Since we don’t want the user to worry about passing level information, we will create a private interface that accepts a URL and also a Level. The private routine will increment this each time and call itself back – thus keeping track of the current level. The initial call to the CrawlURL function will only complete after every possible path has been followed to the maximum level – which may take some time. Theoretically though, it will eventually complete at which time any links that were skipped because the level was too great will be in the queue that was populated when this condition was detected previously. We need to begin the entire process again for each of these items. To iterate through the items in the queue, we will use the Peek method to get the value of the item, and the Dequeue method to remove the item from the queue. We are putting off most of the work for the private member, so the public version of CrawlURL is fairly simple and straightforward:
public void CrawlURL(string URL)

{
CrawlURL(URL, 0);
while (m_qToDo.Count > 0)
{
URL = (string) m_qToDo.Peek();
m_qToDo.Dequeue();
CrawlURL(URL, 0);
}
}
The private member is a bit more involved, but thanks to all our prep work, it’s really a piece of cake. Since there is no reason to continue without a page, we will begin by invoking the LoadSource method of the HTMLContainer class. Remember that although we redefined this class, the method will be executing in the base class – so we don’t need to worry about how it works – it just works. If no success, we will raise a Page Complete event indicating an error and exit. If the page has loaded correctly, we will raise a Page Complete event, passing back the URL and the HTMLDocument object that was just loaded. This will allow the caller to do whatever processing they wish to do with that page. We will also pass back the Level, in case the caller wants to know how deeply the process has dug itself. Finally, a Boolean variable passed by reference – NoFollow can be set to True by the event handler, which will have the same effect as if a NoFollow meta-tag was found in the document.

Assuming the NoFollow parameter from the NewPage event has not been set to True, we look at the NoFollow property determined by the document’s meta-tags. We will only continue if this is FALSE, which it normally is. The GetHRefs method will return an array of the links on this page. This array is passed to a routine called ShuffleURLS, which will weed the list. Any URLs that are already known, mal-formed, match the Exclude list or don’t match the Include list are rejected. The Include and Exclude lists are set using the methods SetIncludes and SetExcludes, not listed, which simply accept an array of strings, and initialize the private string arrays m_strIncludes and m_strExcludes.

A poor man’s inter-process queuing algorithm

You may be wondering why the weed routine is named ShuffleURLs rather than something like WeedURLs. Well it is so named because once the list has been weeded; it is shuffled into random order! It is probably not apparent why it is necessary to shuffle the list into random order and the answer is that it isn’t. However, if you run multiple versions of a program like this at the same time, you will find that the sessions soon begin playing ‘leapfrog’ with each other. Even if you start them using different URLs, they will soon sync themselves in a seemingly magic fashion. The first program will process a given page only to have the second one visit the same page immediately afterward and so on. Of course the ‘right’ way would be to create some sort of inter-process queue to ensure that each thread had its own slice of the total, but randomly shuffling the list only takes a few additional lines of code and is tremendously effective at reducing this problem. It’s a classic ten-cent solution with a $100 payoff.

The main recursion loop

At this point, the HRefs array is populated with the links on this page. We are ready to repeat the process again for each item in the array. The array will be null if there were no links on this page. Otherwise, we will raise the Queuing event back to the caller, passing the HRefs array. Next we iterate through each item in the array. We will recur back again if the maximum level has not been reached, otherwise we will simply add each item to the ToDo queue for processing later. Here is the complete listing for the private version of CrawlURL:
private void 

}

The DataServices class

The basic engine and much of the infrastructure is now ready, but there is still some unfinished business to complete. We still need to add the code to filter URLs based on the values in the Include/Exclude list as well as the entries in the robots.txt file if present. We can combine all of these functions into a third class, which will be responsible for persisting this sort of information – the DataServices class. Let’s start with the code needed to decipher the robots.txt file. When a new domain name is encountered, we will request the robots.txt file for that server. If it is present, we will add any excluded paths to our Exclude list after normalizing them by prefixing the domain name. Whether the robots.txt file is found or not, we will add this domain name to a queue so when it is encountered next time, we will know it’s already been processed. The routine to do this will be a private routine named LoadRobotExcludes in the DataServices class, which will accept a domain name and load the Excludes table if applicable. The AddDomainName and KnownDomainName functions (not listed here) are trivial routines that use the familiar Queue object to add an item to the queue or check to see if it is there.
private void LoadRobotExcludes(


}

Weeding the list

With the Data Services class, we are ready to look at the weeding process. The private function WeWannaCrawl will accept a domain name and URL, check to see if it is suitable, and return True or False depending on the result. The URL must be well formed, not already encountered, match the Include list and not match the Exclude list. This routine will make use of another trivial function not listed here – WellFormedURL, which simply checks to see if the first 7 characters of the URL are equal to “http://” and returns True if so. IncludedPath and ExlcudedPath, also not listed, simply wrap the IncludedURL and ExcludedURL methods already mentioned:
private bool WeWannaCrawl(


}
Only one major routine remains now – ShuffleURLs, which accepts the domain name and the HRefs array, returning a weeded and randomly shuffled list. This routine consists of two main For loops. The first of these weeds the list, only including items that have passed the WeWannaCrawl function’s careful eye. The second loop uses the random number generator to mix everything up into random order: The simple classes presented here provide a valuable tool for anyone who wishes to write a Web Crawler of any type.

web crawler in asp

Building an ASP.NET website search engine


Once a website grows beyond a couple of dozen pages then it can sometimes be difficult to create a site navigation scheme that allows users to quickly find exactly what they're looking for. One way to improve site navigation is to add a search facility to the website.

Unfortunately, building a website search facility for your website can be a time consuming exercise. Although ASP.NET supports the searching of files using the Windows Indexing Service, writing code to query can Indexing Service can be quite complex. Furthermore, not all web hosting companies support the use of Indexing Service, so this may not be an option for your website.

This example shows how to build a website search engine for ASP.NET. The code samples are in C#, but could be easily adapted for the VB.NET programming language.

Building Your Own Search Engine

While it is possible to build a file based search facility using C#, the problem with this approach is that a significant amount of effort would be required to build the file content indexing routine. A database would also be required to store the list of words within the website. Furthermore, if the file system is indexed rather than the actual website then it would be possible for undesirable content (e.g. include files, global.asax files, restricted access documents) to be indexed and appear in search results.

Building a word index for a website by using a web crawler is an obvious solution to these problems. The web crawler sees the same website content as an end user, so there is no problem with undesired content appearing in search results. Web crawlers can also be prevented from indexing certain parts of websites by making use of robots.txt files and the robots meta tag. Furthermore, a web crawler is not dependent on the underlying technology used on a website, so can crawl websites regardless of whether they use PHP, ASP, ASP.NET or a combination of all three.

Building a web crawler is not a trivial exercise, so this code sample relies on our web crawling product - The Website Utility. This product crawls any website and automatically builds the .NET class necessary to allow the website to be searched for text strings. Note that version 2.0 of the Microsoft .NET Framework or above is required.

The .NET search engine created by The Website Utility is contained within the partial class TWUSearch of the namespace com.WinnershTriangle.TheWebsiteUtility. The partial class is contained in two files: TWUSearchCode.cs and TWUSearchData.cs. Both of these files should be copied to the ASP.NET web application's App_Code folder - the TWUSearch class is then accessible to other code files in the web application.

The TWUSearch partial class has a number of methods and properties, which are described below:

Methods

  • SetQuery(query as string) (returns void) Displays a message that no matching results were found.
  • GetSearchResults() (returns DataSet): Retrieves search results.
  • GetErrorMessage() (returns string): Retrieves a description of the error.

Properties

  • MaximumSearchResults (int): Gets/sets the number of matching documents.
  • ReturnPageTitles (bool): Optionally turn offs the return of page titles in the DataSet.
  • ReturnPageDescriptions (bool): Optionally turn offs the return of page descriptions in the DataSet.
  • HasErrors (bool): Returns true if an error occurred (use the GetErrorMessage() method to retrieve the error message).
  • DebugMessage (string): Returns debugging messages (for troubleshooting only).

The C# partial class file TWUSearchData.cs contains the data structures needed for the search class. If you re-crawl a website to update the search facility, this is the only file that will have changed, so updating the search facility may be achieved by overwriting the website's previous copy of this file.

Using the ASP.NET Search Object from C#

The source code below shows how to instantiate the .NET website search class and retrieve a DataSet of search results matching the search query. In this example, the query is set from the Text property of a textbox called TWUSearch, and the search results are databound to the GridView1 GridView control.

The results are sorted in descending rank by making use of the DataView's Sort method.

///


/// Show the search results after the search button is invoked
///

///
///

protected void submitbutton_Click(object sender, EventArgs e)
{
//Initialise the search class
com.WinnershTriangle.TheWebsiteUtility.TWUSearch SearchObject = new com.WinnershTriangle.TheWebsiteUtility.TWUSearch();

//Set search query from the TextBox control
SearchObject.SetQuery(TWUQuery.Text);

//Initialise a DataSet for the search results
DataSet SearchData = new DataSet();

//Optionally change the maximum number of search results (default is 50)
SearchObject.MaximumSearchResults = 25;

//Optionally turn off the return of page titles (default is to return titles)
SearchObject.ReturnPageTitles = true;

//Optionally turn off the return of page descriptions (default is to return descriptions)
SearchObject.ReturnPageDescriptions = true;

//Retrieve the search results
SearchData = SearchObject.GetSearchResults();

//Note that if the search facility encounters an error you can call
//the GetErrorMessage() method to retrieve a description of the error.

string SearchError = SearchObject.GetErrorMessage();

//Check to see if any matching pages were found
if (SearchObject.NumberOfMatchingPages == 0)
{

//Did an error occur?
if (SearchObject.HasErrors == false)
{

//User probably searched for a term that does not exist
LabelSearchResults.Text = "No matching pages were found for this query. Please try another search.";
GridView1.Visible = false;
}

}

//Did an error occur?
if (SearchObject.HasErrors)
{

LabelSearchResults.Text = "This search failed due to: " + SearchError + ". Please try another search.";
GridView1.Visible = false;
}

//No errors were encountered and there were matching pages in the search
//results, so display the search results GridView

if (SearchObject.HasErrors == false && SearchObject.NumberOfMatchingPages > 0)
{

//Create a DataView from the search results data
DataView SearchDataView = new DataView(SearchData.Tables[0]);

//Sort the search results by rank
SearchDataView.Sort = "PageRank DESC";

GridView1.DataSource = SearchDataView;
GridView1.Visible = true;

//Show the number of search results
LabelSearchResults.Text = SearchObject.NumberOfMatchingPages.ToString() + " matching page(s) were found.";

//Bind the search results data to the GridView
GridView1.DataBind();

}

}

How it Works

The Website Utility extracts all of the words from the website and finds the most relevant pages in the website for each word. Common English words (e.g. got, like, then) are removed, as are words of one or two characters in length. Word rankings depend on many factors, including their distribution through the entire website and their distribution in the content of a specific page.

Pages are sorted in search results according to their ranking for the particular word or words being searched for. The ranking scale goes from 0 to 99. Rank is higher for pages that most closely match the search term. In general, searching for words that are common on the site will produce search results with a lower rank than very specific words that occur on only one or two pages.

Important Note: For very large websites or more sophisticated searching, you may need to consider using a specialised server-based search solution such using ASP.NET to search Microsoft's Indexing Service. The Indexing Service Companion can be used to allow Index Server to search remote websites (and also to search more than one website simultaneously).