What is a Spider ?

What is a spider ?

A 'spider' is a searchbot - the tool that a search engine uses to crawl a website and index its pages for the search engine results. Because it 'crawls the web' it is popularly termed a spider.

In fact there is of course no entity leaving a search engine's premises, setting out on a journey across the web, entering websites, listing their pages, and going on to the next site. It's simply a program that resides at the search engine's datacentre - but it's convenient to think of there being something physical that goes out there and grabs website data.

Why do we need to bother about spiders ?

Without spiders there would be no websites listed or appearing in the search results. So we need to make their task as easy as possible, and ensure that spiders are both welcome at our site, and have a straightforward job indexing our resources.

Once you realise this fact, it becomes easier to arrange things so that your website is indexed properly - for example you will remove all Flash from your website navigation, as spiders hate it. Most cannot pass through Flash links, and those that can, do not do so easily.

What is a good website for spidering ?

Best practice for SEO means that a website must be set up to be easily spidered. All these points can be covered:
  • The site has simple HTML / CSS menus for navigation
  • The pagecode is as clean as possible
  • There is a minimum of JavaScript, and preferably none
  • Flash is only used for the occasional graphics
  • There are at least two formats of sitemap
  • All pages have links to them
  • The most important pages have the most links
  • All pages can be reached with three clicks or less
  • There are many backlinks to the site
  • Some backlinks point to the website front page
  • Many backlinks point to inside pages
  • Most links point to the most important pages
  • There are a variety of resources

What is a bad website for spidering ?

A poorly-arranged website will be indexed badly, and have low or erratic search results. These faults may exist:
  • The site is built in Flash (this type of site is commercially useless since spiders either cannot read the content or do so with difficulty - and there are also many other issues)
  • The site is built partly in Flash (this type of site has major handicaps)
  • Many resources are in Flash
  • The menus are Flash or JavaScript
  • The page code is obsolete, heavily scripted or of poor quality
  • There is an abundance of JavaScript on the site
  • There are no sitemaps
  • Some pages are not linked to - these are called orphan pages
  • There are the same amount of links to all pages
  • Some pages take five or more clicks to reach
  • The site has few links
  • All links point to the front page
  • There are only straight web pages on the site, and nothing else

It can be seen that websites need to be built to be easily indexed by search engines, as this is a commercial necessity. Sites can generally be repaired although it is cheaper to build them correctly in the first place.

What is spider food ?

'Spider food' is said to be useful resources that are not web pages, such as images, video, pdf files, and forums.

This is because it can be seen that such alternative resources are favoured by search engine spiders and comparatively well-spidered compared to basic web pages.

This is probably because search engines are looking for useful resources for their customers, and such items are slightly more favoured as there are less of them and they may present more useful or popular information. Therefore, a site should include such additional resources where possible.

These can include such items as gfx, pdf files for download, mpeg video, Flash vid, charts, tutorials, reviews, forum, blog, wiki, directory, net resources, photos, images etc

How does a search engine work ?

The final part of the question 'what is a spider' is an explanation of how a search engine works, and how it uses the resources a spider finds. Here is a sequence that explains how search engines find a web page, how they index and rate it, and how it appears in search results.

A search engine is a group of computers that may exist at one place, but is more commonly located at many computer centres, often called datacentres. There are research computers, storage computers, spidering computers, and server computers, which work as follows:
  • The spidering computers are tasked with discovering web resources
  • They search the web for new resources, and check existing resources for updates
  • There are millions of web checks running at any one time - or if you like, there are a lot of spiders out there
  • The addresses of all pages and other resources found are saved on the storage computers
  • The pages are listed in the search engine's index if acceptable
  • The pages are ranked and rated according to a complex algorithm that is the exclusive design of the search engine, and possibly its most important property
  • The pages, and the websites they are on, are assigned various ratings according to the algo's evaluation
  • A page is assigned a position in the search results for every keyword that the page is relevant to
  • The page's position is continually being reassessed
  • A strong page on a strong site will place well - and both are important factors
  • Many individual factors are assessed in order to assign a position in the search results - the research computers are continually working on the web's resources

The search-answer sequence goes like this:

- A person who wants to know something, opens their computer and asks their browser to find a suitable resource.
- The browser connects to the search engine's datacentre. It asks a server computer there for information.
- The server asks a storage computer for the listings, and a list of results is delivered to the enquirer.
- The enquirer chooses a result from the list and clicks on it, and is passed along to the web resource chosen.

They may choose a regular listing, aka an organic result - or they may choose an advertisement, and these are normally of the PPC type. Approximately 40% of clickthroughs are supposed to be for the #1 slot, the first result in the organic results.

Probably the most impressive feature of modern search engines, especially the top performers, is their sheer speed. The way they can produce a ranked list of results for any enquiry in a second or so, from billions of resources, is impressive if not miraculous.

The majority of new visitors to most websites come from search engines. If the site is a good resource, people bookmark it, and return later. Most conversions (orders, sign-ups etc) occur on a second or subsequent visit, not on the first visit.
 
© LP Web Development 2009 - All Rights Reserved