I’ve heard people wonder about what sort of artificial intelligence or biological system is involved in google. Web searches are really quite mechanical. Here’s an overview of what really goes on within Google.
(If you like this sort of thing, see my backgrounder on Baysian Filtering of Spam.
There are only two things that matter in a Web search:
1) What are the set of pages included in the results?
2) How are they ordered?
1: Google first strips out words that some engineer decided are too common to be relevant. Then it creates a result set that includes every single page it knows of that mentions each and every one of the remaining words you mentioned. The words can be in the text, the title, the url, and I’m not sure what else, but they have to be there.
Google became famous while only allowing an exact match for each word. Now the result set is larger by allowing alternatives explicitly (word choices separated by OR), by stemming (under Google engineers’ best judgment, in a Bose/Disney sort of way, but you can turn it off by surrounding the word or phrase or with double-quotes), and by synonyms (precede the word with tilde).
The result set can also be narrowed by requiring a word to be mentioned in the url, say, or the page has to be in English. “Unimportant” words can be added back into the requirements by preceding with a +. Pages can be omitted if they mention a word preceded by -. There’s also some filtering based on their idea (or the Chinese government’s idea) of “bad”.
(I find it illustrative that every engineer thinks that surely, people must need control over case sensitivity. Google discovered this is a useless feature and does not clutter up the program with such nonsense.)
2: Google ranks pages primarily by two factors:
* Pages in “better” sites are listed earlier, where “better” is determined by how many other “better” sites link to the one being considered.
* Relevance to your subject is computed by how the words are grouped together (phrases in the page matching to a greater or lesser degree with phrases in your query) and where they appear (in titles, bold, etc.).
Other search engines use the latter, but I think Google has some intellectual property protection on the first (”page rank”) technique.
All this is completely algorithmic. John has written about the connection between AI and art. OK, imagine an artist, who, through hard work, experimentation, and taste, discovers a process that allows him to reliably and cost-effectively reproduce an admired original work. Let’s say most of the artifacts are considered “pretty nice” by most people. Regardless of whatever intuition the artist used in creating the process, the cranking out of artifacts is now mostly a work of craftsmanship, not originality. That’s Google. The cleverness is in the engineers coming up with something that works darn well in most cases and executes efficiently and reliably. Now it’s just an algorithm.
If the resulting process bears any simile or metaphor to what a librarian might do on a manual search, it’s just an accident. What the engineers set out to do was produce pretty good results on a particular set of hardware. There’s no magic.
Here’s another analogy. A flyball governor is one of those victorian looking scissors action deelies ending in two big brass balls that spin around. It’s hooked up in such a way that if a shaft on some steam engine or something starts spinning too fast, the governor spins in such a way that it will lessen the amount of fuel or air or something for the fire. (The expanding scissors action at higher speed pulls a control rod up.) Google’s page rank and some of it’s other algorithms are also pretty darn dynamic and self-adjusting, but they’re no more biological than a governor. To my mind, coming up with the Google idea and making it work is as much a thing of beauty as James Watt’s creation of these brass ball thingies. But the balls themselves aren’t being smart in the work they do. They’re not even savant within the limited domain of “speed regulation.”