Section 3.1. How Your Site Appears to a Bot | Google Advertising Tools: Cashing in with Adsense, Adwords, and the Google APIs

3.1. How Your Site Appears to a Bot

To state the obvious, before your site can be indexed by a search engine, it has to be found by the search engine. Search engines find web sites and web pages using software that follows links to crawl the Web. This kind of software is variously called a crawler, a spider, a search bot, or simply a bot (bot is a diminutive for "robot").

You may be able to short circuit the process of waiting to be found by the search engine's bot by submitting your URL directly to search engines, as explained in Chapter 2.

To be found quickly by a search engine bot, it helps to have inbound links to your site. More important, the links within your site should work properly. If a bot encounters a broken link, it cannot reach, or index, the page pointed to by the broken link.

3.1.1. Images

Pictures don't mean anything to a search bot. The only information a bot can gather about pictures comes from the alt attribute used within a picture's <img> tag and from text surrounding the picture. Therefore, always take care to provide description information via the alt along with your images and at least one text-only link (for example, outside of an image map) to all pages on your site.

3.1.2. Links

Some kinds of links to pages (and sites) simply cannot be traversed by a search engine bot. The most significant issue is that a bot cannot log in to your site. So if a site or page requires a username and a password for access, then it probably will not be included in a search index.

Don't be fooled by seamless page navigation using such techniques as cookies or session identifiers. If an initial login was required, then these pages probably cannot be accessed by a bot.

Complex URLs that involve a script can also confuse the bot (although only the most complex dynamic URLs are absolutely nonnavigable). You can generally recognize this kind of URL because a ? is included following the script name. Here's an example: http://www.digitalfieldguide.com/resources.php?set=313312&page=2&topic=Colophon. Pages reached with this kind of URL are dynamic, meaning that the content of the page varies depending upon the values of the parameters passed to the page generating the script (the name of the script comes before the ? in the URL). In this example URL, the parameters are passed to the resources.php script as name=value pairs separated by ampersands (&). If the topic parameter were changedfor example, to topic=Equipment using the URL http://www.digitalfieldguide.com/resources.php?set=313312&page=2&topic=Equipmenta page with different content would open.

You can try this example by comparing the two URLs to see for yourself the difference a changed parameter makes!

Dynamic pages opened using scripts that are passed values are too useful to avoid. Most search engine bots can traverse dynamic URLs provided they are not too complicated. But you should be aware of dynamic URLs as a potential issue with some search engine bots, and try to keep these URLs as simple, using as few parameters, as possible.

3.1.3. File Formats

Most search engines, and search engine bots, are capable of parsing and indexing many different kinds of file formats. For example, Google states that "We are able to index most types of pages and files with very few exceptions. File types we are able to index include: pdf, asp, jsp, html, shtml, xml, cfm, doc, xls, ppt, rtf, wks, lwp, wri, swf."

However, simple is often better. To get the best search engine placement, you are well advised to keep your web pages, as they are actually opened in a browser, to straight HTML. Note a couple of related issues:

A file with a suffix other than .htm or .html can contain straight HTML. For example, generated .asp, .cfm, .php, and .shtml files often consist of straight HTML.
Scripts (or include files) running on your web server usually generate HTML pages that are returned to the browser. This architecture is shown in Figure 3-1. An important implication: check the source file as shown in a browser rather than the script file used to generate a dynamic page to see what the search engine will index.

Figure 3-1. Server-sides scripts and includes serve HTML pages to a browser

Google puts the "simple is best" precept this way: "If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." The only way to know for sure whether a bot will be unable to crawl your site is to check your site using an all-text browser.

3.1.4. Viewing Your Site with an All-Text Browser

Improvement implies a feedback loop: you can't know how well you are doing without a mechanism for examining your current status. The feedback mechanism that helps you improve your site from an SEO perspective is to view it as the bot sees it. This means viewing the site using a text-only browser. A text-only browser, just like the search engine bot, will ignore images and graphics and only process the text on a page.

The best-known text-only web browser is Lynx. You can find more information about Lynx at http://lynx.isc.org/. Generally, the process of installing Lynx involves downloading source code and compiling it.

The Lynx site also provides links to a variety of precompiled Lynx builds you can download.

Don't want to get into compiled source code or figuring out which idiosyncratic Lynx build to download? There is a simple Lynx Viewer available on the Web at http://www.delorie.com/web/lynxview.html.

First open the Lynx Viewer web page. Next, you'll need to follow the directions to make sure that a file named delorie.htm is saved in the root directory of your web site. To do this, you'll either need FTP access to upload a file to your web server, or the ability to create an empty page on your site.

It doesn't matter what's in this file. Its sole purpose is to make sure you own or control the site you are testing.

Finally, simply enter your URL, and see what your site looks like in a text-only version. Figure 3-2 shows the text-only version of Photoblog 2.0. It's certainly easier to see the text that the search bot sees when you are not distracted by the "eye candy" of the full image version (Figure 3-3).

Figure 3-2. Lynx Viewer makes it easy to focus on text and links without the distraction of the image-rich version (Figure 3-3)