3.1. How Your Site Appears to a BotTo state the obvious, before your site can be indexed by a search engine, it has to be found by the search engine. Search engines find web sites and web pages using software that follows links to crawl the Web. This kind of software is variously called a crawler, a spider, a search bot, or simply a bot (bot is a diminutive for "robot").
To be found quickly by a search engine bot, it helps to have inbound links to your site. More important, the links within your site should work properly. If a bot encounters a broken link, it cannot reach, or index, the page pointed to by the broken link. 3.1.1. ImagesPictures don't mean anything to a search bot. The only information a bot can gather about pictures comes from the alt attribute used within a picture's <img> tag and from text surrounding the picture. Therefore, always take care to provide description information via the alt along with your images and at least one text-only link (for example, outside of an image map) to all pages on your site. 3.1.2. LinksSome kinds of links to pages (and sites) simply cannot be traversed by a search engine bot. The most significant issue is that a bot cannot log in to your site. So if a site or page requires a username and a password for access, then it probably will not be included in a search index.
Complex URLs that involve a script can also confuse the bot (although only the most complex dynamic URLs are absolutely nonnavigable). You can generally recognize this kind of URL because a ? is included following the script name. Here's an example: http://www.digitalfieldguide.com/resources.php?set=313312&page=2&topic=Colophon. Pages reached with this kind of URL are dynamic, meaning that the content of the page varies depending upon the values of the parameters passed to the page generating the script (the name of the script comes before the ? in the URL). In this example URL, the parameters are passed to the resources.php script as name=value pairs separated by ampersands (&). If the topic parameter were changedfor example, to topic=Equipment using the URL http://www.digitalfieldguide.com/resources.php?set=313312&page=2&topic=Equipmenta page with different content would open.
Dynamic pages opened using scripts that are passed values are too useful to avoid. Most search engine bots can traverse dynamic URLs provided they are not too complicated. But you should be aware of dynamic URLs as a potential issue with some search engine bots, and try to keep these URLs as simple, using as few parameters, as possible. 3.1.3. File FormatsMost search engines, and search engine bots, are capable of parsing and indexing many different kinds of file formats. For example, Google states that "We are able to index most types of pages and files with very few exceptions. File types we are able to index include: pdf, asp, jsp, html, shtml, xml, cfm, doc, xls, ppt, rtf, wks, lwp, wri, swf." However, simple is often better. To get the best search engine placement, you are well advised to keep your web pages, as they are actually opened in a browser, to straight HTML. Note a couple of related issues:
Figure 3-1. Server-sides scripts and includes serve HTML pages to a browserGoogle puts the "simple is best" precept this way: "If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." The only way to know for sure whether a bot will be unable to crawl your site is to check your site using an all-text browser. 3.1.4. Viewing Your Site with an All-Text BrowserImprovement implies a feedback loop: you can't know how well you are doing without a mechanism for examining your current status. The feedback mechanism that helps you improve your site from an SEO perspective is to view it as the bot sees it. This means viewing the site using a text-only browser. A text-only browser, just like the search engine bot, will ignore images and graphics and only process the text on a page. The best-known text-only web browser is Lynx. You can find more information about Lynx at http://lynx.isc.org/. Generally, the process of installing Lynx involves downloading source code and compiling it.
Don't want to get into compiled source code or figuring out which idiosyncratic Lynx build to download? There is a simple Lynx Viewer available on the Web at http://www.delorie.com/web/lynxview.html. First open the Lynx Viewer web page. Next, you'll need to follow the directions to make sure that a file named delorie.htm is saved in the root directory of your web site. To do this, you'll either need FTP access to upload a file to your web server, or the ability to create an empty page on your site.
Finally, simply enter your URL, and see what your site looks like in a text-only version. Figure 3-2 shows the text-only version of Photoblog 2.0. It's certainly easier to see the text that the search bot sees when you are not distracted by the "eye candy" of the full image version (Figure 3-3). Figure 3-2. Lynx Viewer makes it easy to focus on text and links without the distraction of the image-rich version (Figure 3-3) |