How Search Engines Work (and Sometimes Don’t)

Posted on June 28, 2009, 5:36 pm, by Saahir.

You know how important it is to score high in the SERPs. But your site isn’t reaching the first three pages, and you don’t understand why. It could be that you’re confusing the web crawlers that are trying to index it. How can you find out? Keep reading.You have a masterful website, with lots of relevant content, but it isn’t coming up high in the search engine results pages (SERPs). You know that if your site isn’t on those early pages, searchers probably won’t find you. You can’t understand why you’re apparently invisible to Google and the other major search engines. Your rivals hold higher spots in the SERPs, and their sites aren’t nearly as nice as yours.

Search engines aren’t people. In order to handle the tens of billions of web pages that comprise the World Wide Web, search engine companies have almost completely automated their processes. A software program isn’t going to look at your site with the same “eyes” as a human being. This doesn’t mean that you can’t have a website that is a joy to behold for your visitors. But it does mean that you need to be aware of the ways in which search engines “see” your site differently, and plan around them.

Despite the complexity of the web, and dealing with all that data at speed, search engines actually perform a short list of operations in order to return relevant results to their users. Each of these four operations can go awry in certain ways. It isn’t so much that the search engine itself has gone awry; it may have simply encountered something that it was not programmed to deal with. Or the way it was programmed to deal with whatever it encountered led to less than desirable results.

Understanding how search engines operate will help you understand what can go wrong. All search engines perform the following four tasks:

Web crawling. Search engines send out automated programs, sometimes called “bots” or “spiders,” which use the web’s hyperlink structure to “crawl” its pages. According to some of our best estimates, search engine spiders have crawled maybe half of the pages that exist on the Internet.

Document indexing. After spiders crawl a page, its content needs to be put into a format that makes it easy to retrieve when a user queries the search engine. Thus, pages are stored in a giant, tightly managed database that makes up the search engine’s index. These indexes contain billions of documents, which are delivered to users in mere fractions of a second.
Query processing. When a user queries a search engine, which happens hundreds of millions of times each day, the engine examines its index to find documents that match. Queries that look superficially the same can yield very different results. For example, searching for the phrase “field and stream magazine,” without quotes around it, yields more than four million results in Google. Do the same search with the quote marks, and Google returns only 19,600 results. This is just one of many modifiers a searcher can use to give the database a better idea of what should count as a relevant result.
Ranking results. Google isn’t going to show you all 19,600 results on the same page – and even if it did, it needs some way to decide which ones should show up first. Thus, the search engine runs an algorithm on the results to calculate which ones are most relevant to the query. These are shown first, with all the others in descending order of relevance.

Now that you have some idea of the processes involved, it’s time to take a closer look at each one. This should help you understand how things go right, and how and why these tasks can go “wrong.” This article will focus on web crawling, while a later article will cover the remaining processes.

You’re probably thinking chiefly of your human visitors when you set up your website’s navigation, as well you should. But certain kinds of navigation structures will trip up spiders, making it less likely for those visitors to find your site in the first place. As an added bonus, many of the things you do to your site that will make it easier for a spider to find content, will often make it easier for visitors to navigate your site.

It’s worth keeping in mind, by the way, that you might not want spiders to be able to index everything on your site. If you own a site with content that users pay a fee to access, you probably don’t want a Google bot to grab that content and show it to anyone who enters the right keywords. There are ways to deliberately block spiders from such content. In keeping with the rest of this article, which is intended mainly as an introduction, they will only be mentioned briefly here.

Dynamic URLs are one of the biggest stumbling blocks for search engine spiders. In particular, pages with two or more dynamic parameters will give a spider fits. You know a dynamic URL when you see it; it usually has a lot of “garbage” in it such as question marks, equal signs, ampersands (&) and percent signs. These pages are great for human users, who usually get to them by setting certain parameters on a page. For example, typing a zip code into a box at weather.com will return a page that describes the weather for a particular area of the US – and a dynamic URL as the page location.

There are other ways in which spiders don’t like complexity. For example, pages with more than 100 unique links to other pages on the same site can make them get tired with just one look. A spider may not follow each link. If you are trying to build a site map, there are better ways to organize it.

Pages that are buried more than three clicks from your website’s home page also might not be crawled. Spiders don’t like to go that deep. For that matter, many humans can get “lost” on a website with that many levels of links if there isn’t some kind of navigational guidance.

Pages that require a “Session ID” or cookie to enable navigation also might not be spidered. Spiders aren’t browsers, and don’t have the same capabilities. They may not be able to retain these forms of identification.

Another stumbling block for spiders is pages that are split into “frames.” Many web designers like frames; it allows them to keep page navigation in one place even when a user scrolls through content. But spiders find pages with frames confusing. To them, content is content, and they have no way of knowing which pages should go in the search results. Frankly, many users don’t like pages with frames either; rather than providing a cleaner interface, such pages often look cluttered.

Most of the stumbling blocks above are ones you may have accidentally put in the way of spiders. This next set of stumbling blocks includes some that website owners might use on purpose to block a search engine spider. While I mentioned one of the most obvious reasons for blocking a spider above (content that users must pay to see), there are certainly others: the content itself might be free, but should not be easily available to everyone, for example.

Pages that can be accessed only after filling out a form and hitting “Submit” might as well be closed doors to spiders. Think of them as not being able to push buttons or type. Likewise, pages that require use of a drop down menu to access might not be spidered, and the same holds true for documents that can only be accessed via a search box.

Documents that are purposefully blocked will usually not be spidered. This can be handled with a robots meta tag or robots.txt file. You can find other articles that discuss the robots.txt file on SEO Chat.

Pages that require a login block search engine spiders. Remember the “spiders can’t type” observation above. Just how are they going to log in to get to the page?

Finally, I’d like to make a special note of pages that redirect before showing content. Not only will that not get your page indexed, it could get your site banned. Search engines refer to this tactic as “cloaking” or “bait-and-switch.” You can check Google’s guidelines for webmasters (http://www.google.com/intl/en/webmasters/guidelines.html) if you have any questions about what is considered legitimate and what isn’t.

Now that you know what will make spiders choke, how do you encourage them to go where you want them to? The key is to provide direct HTML links to each page you want the spiders to visit. Also, give them a shallow pool to play in. Spiders usually start on your home page; if any part of your site cannot be accessed from there, chances are the spider won’t see it. This is where use of a site map can be invaluable.

I’ll assume that you are all reasonably familiar with HTML. If you have ever looked at the source code for an HTML page, you probably noticed text like this wherever a hyperlink appeared:

When a web browser reads this, it knows that the text “SEO Chat” should be hyperlinked to the web page http://www.seochat.com. Incidentally, “SEO Chat” in this case is the “anchor text” of the link. When a spider reads this text, it thinks, “Okay, the page http://www.seochat.com is relevant to the text on this page, and very relevant to the term `SEO Chat.’”

Let’s get a little more complicated.

Now what? The anchor text hasn’t changed, so the link will still look the same when the web browser displays it. But a spider will think, “Okay, not only is this page relevant to the term `SEO Chat,’ it is also relevant to the phrase `Great Site for SEO Info.’ And hey, there’s a relationship between the page I’m crawling now and this hyperlink! It says that this link doesn’t count as a ‘vote’ for the page being linked to. Okay, so it won’t add to the page rank.”

That last point, about the link not counting as a vote for the page being linked to, is what the rel=”nofollow” tag does. This tag evolved to address the problem of people submitting linked comments to blogs that said things like “Visit my pharmaceuticals site!” That kind of comment is an attempt by the commenter to raise his own website’s position in the search engine rankings. It’s called comment spam, by the way; most major search engines don’t like comment spam because it skews their results, making them less relevant. As you may have guessed, then, the “nofollow” tag in the “rel” attribute is specifically for search engines; it really isn’t there to be noticed by anyone else. Yahoo!, MSN, and Google recognize it, but AskJeeves does not support nofollow; its crawler simply ignores the nofollow tag.

In some cases, a link may be assigned to an image. The hyperlink would then include the name of the image, and might include some alternate text in an “alt” attribute, which can be helpful for voice-based browsers used by the blind. It also helps spiders, because it gives them another clue for what the page is about.

Hyperlinks may take other forms on the web, but by and large those forms do not pass ranking or spidering value. In general, the closer a link is to the classic <a href=”URL”>text</a>, the easier it is for a spider to follow a link, and vice versa.
Article by SEO CHAT