toggle.org, TOGGLE On-line Newsletter

Number 210 - November 2000

Search Engines
by John Buck, Stanford Palo Alto UG [This is a summary of Michael Specter's article, "Search and Deploy, " from the 05/29/00 New Yorker. I recommend you read the original for more info. Scientific American's July 1999 issue also had an article on the subject.]
Most search engines use spiders, which crawl through the Web picking up every link on every page, without considering relevancy. "It's ironic, but the bigger the Internet gets, the more difficult it is to find a simple, accurate answer to your questions," said Lawrence Page, a founder of Google, along with Sergey Brin. (Google derives from the word "googol"--the number 1 followed by 100 zeroes.) Even the largest search engine, Inktomi, has indexed only about half the web. Yahoo! isn't really a search engine at all; it has a team of editors who index the Internet. If you want a page to show up on a Yahoo! search, you must submit a form with information about the site. One of the biggest problems to be solved in trying to retrieve the information you want is "the verbal disagreement problem" Verbal disagreement means that you and someone else may not use the same words or phrases to describe what you're looking for or what you've found. A search for "automobile," may miss pages that use "car." Different search engines or indexes use different methods for determining which pages will be at the top of a listing. Some page makers repeat words many times on a page, in invisible type, so a search engine thinks they are more relevant than it otherwise would. Some searchers or indexers accept money (x cents per click) from page owners, which encourages them to put certain pages higher in their listings than they otherwise would. What Google's founders saw, apparently before others, was that hyperlinks could be used to assign	values to web pages-the more links, the more value. They go even farther by assigning values based on the pages where those links are found. A link on New York Times' home page carries more value than a link from a personal home page might, because more pages link to the Times' page than the personal one. IBM's Almaden Research Center's Clever Project looks at links much like Google does, but can assign them different values, depending on the search request. When it finds a page filled with useful links on a subject, it calls it a "hub" page. "Then, unlike Google, it analyzes the hubs to discover 'authorities'--pages that online experts in [a particular subject] regard as the most useful and interesting--and uses the authorities to judge the quality of the hubs. Emerging from all that is ...[what IBM's Andrew Tomkins calls] `the footprint of a community. The surprising thing is that as the number of pages grows, the number of communities shrinks. ...This is a way to understand the emergence of ...patterns ...trends, ideas, communities. It could be beyond search. It could give people what they are looking for.'" Because there doesn't seem to be much money in simply searching, many search engines and web indexers have become portals, which encourage other uses and accept advertising. The article ends with this, from Google's Page, "The great thing about search is that we are not going to solve it any time soon. ...I see no end to what we need to do. If we aren't a lot better next year, we will already be forgotten."
Number 210 - November 2000