Wednesday, February 13, 2008

Why Spider have difficulty to Crawl a Web Page

Certain types of navigation may block or entirely prevent search engines spider from reaching your website's content. As search engine spiders crawl the web, they rely on the architecture of hyperlinks to find new documents and revisit those that may have changed. In the analogy of speed bumps and walls, complex links and deep site structures with little unique content may serve as "bumps." Data that cannot be accessed by spiderable links qualify as "walls."

Possible "Speed Bumps" for SE Spiders:

URLs with 2+ dynamic parameters; i.e. http://www.url.com/page.php?id=4&CK=34rr&User=%Tom% (spiders may be reluctant to crawl complex URLs like this because they often result in errors with non-human visitors)
Pages with more than 100 unique links to other pages on the site (spiders may not follow each one)
Pages buried more than 3 clicks/links from the home page of a website (unless there are many other external links pointing to the site, spiders will often ignore deep pages)
Pages requiring a "Session ID" or Cookie to enable navigation (spiders may not be able to retain these elements as a browser user can)
Pages that are split into "frames" can hinder crawling and cause confusion about which pages to rank in the results.

How Our Web Page Appear in Search Engines

When you searching for information in the internet, you certain type a query or keyword frase in search engine, and search engine will provide search result for that query or keyword frase.

So.. how can a Web Page Appear in Search Engines ?
Here the process:

1. Crawling a Web Page
Search engines run an automated programs, called "bots" or "spiders", that use the hyperlink structure of the web to "crawl" the pages and documents that make up the World Wide Web. Estimates are that of the approximately 20 billion existing pages, search engines have crawled between 8 and 10 billion.

2. Indexing Documents
When a page has been crawled, the contents will be "indexed" and stored in a giant database of documents that makes up a search engine's "index". This index needs to be tightly managed so that requests which must search and sort billions of documents can be completed in fractions of a second.

3. Processing Queries
When a request for information comes into the search engine (hundreds of millions do each day), the engine retrieves from its index all the document that match the query. A match is determined if the terms or phrase is found on the page in the manner specified by the user.

For example, a search for fish and fishing magazine at Google returns 8.25 million results, but a search for the same phrase in quotes ("fish and fishing magazine") returns only 166 thousand results. In the first system, commonly called "Findall" mode, Google returned all documents which had the terms "fish", "fishing", and "magazine" (they ignore the term "and" because it's not useful to narrowing the results), while in the second search, only those pages with the exact phrase "fish and fishing magazine" were returned. Other advanced operators (Google has a list of 11) can change which results a search engine will consider a match for a given query.

4. Ranking Results
Once the search engine has determined which results are a match for the query, the engine's algorithm (a mathematical equation commonly used for sorting) runs calculations on each of the results to determine which is most relevant to the given query. They sort these on the results pages in order from most relevant to least so that users can make a choice about which to select.

Although search engine operations are not particularly lengthy, systems like Google, Yahoo!, AskJeeves, and MSN are among the most complex, processing-intensive computers in the world, managing millions of calculations each second and funneling demands for information to an enormous group of users.