From Jim Gilliam's blog archivesSearch: By the People, For the People
April 17, 2003 1:08 PM
LookSmart is adopting Grub's Distributed Crawling Project. Anyone can sign up and become a crawler for Grub, similar to the SETI@home project. Their hope is to get enough clients to index the entire web every day.
The big problem with this project is the bandwidth needed. With a non-distributed crawler, the pages have to be downloaded to the main servers just once. With a distributed crawler, they have to be downloaded at the client, and then uploaded to the server. Uploading to the server from the client is the same as having the server download the page in the first place. So the work is doubled. While it's possible to reduce the size of the data uploaded to the central server by parsing the web pages, to build an effective search engine you need all the data, so the client can't reduce the size much. For example, Google keeps the entire page intact.
This short coming is only a result of the data being stored centrally. What if the data was left on the client computers and searches themselves were distributed, ala Gnutella? Obviously, the issue then becomes scaling the search requests, but this is already being addressed in the P2P community.
Wait, doesn't this sound like InfraSearch? Remember that project from a few years ago? It got folded into Sun's JXTA project two years ago. JXTA is a general purpose distributed computing platform. InfraSearch, now JXTA Search, was moved to that platform.
Well, it's not quite the same. JXTA Search relies on the provider of the content implementing its software and exposing its data to the search network. This is clever in the sense that now any data can be searched as long as there's an adapter to the content provider's data source. It doesn't need to be a webpage. But there's a big problem: search results are left up to the content provider. As anyone that has worked on a major search engine will tell you, getting spammed by content providers is one of the biggest problems hindering relevant results. Google will ban anyone caught spamming their index, and they police this regularly.
Here's an idea for a hybrid of the two ideas. It doesn't replace Google, at least not any time soon, but provides a new kind of search experience. Every time a person loads a webpage on their computer, they've crawled it. Why not just take that webpage and add it to a local index on their computer? A little IE or Mozilla plugin would do the trick. Google already has a distributed computing project that loads into their toolbar.
Something like this requires a huge adoption rate to crawl the entire web, but there are plenty of benefits in the short term that anyone using this client would get. They could instantly search all the webpages they've already seen. This is a big problem that doesn't seem to be getting addressed. What was that article I saw? It was linking the Iran Contra scandal and Al Qaeda. Can't find it in Google, I know I just saw it a few weeks ago, but can't remember what site it was on. A simple search of the webpages I've been to for "iran contra al qaeda" would have popped up this article from Wired.
The short term benefits don't stop there. I could easily expand my search to include all my friends, or my friends plus their friends. That would spur viral growth in client adoption. Bloggers could make their indexes available to their readers.
An adaptive personalized search experience.
One big advantage of P2P is that the popular information automatically gets propagated to many different nodes simply because it's popular. New memes get indexed very quickly. No more search engine lag. Query disambiguation is addressed as well since my extended friends family is going to interpret a query more in line with how I think than the entire web.
Privacy would be an issue. Currently this is addressed in P2P networks by letting people choose what they want to share. People could block certain types of sites from going into their public index. I tend to think there are plenty of people that value the community gained from exposing their indexes. Witness blogs.
Oh, and assuming it's open source, there's no company involved. No centralized servers. Impossible to sue. No way to censor or shut it down.
Fundamentally, weblogs are information managers, so isn't this a logical next step for blogging?
A search engine built by the people, for the people.
UPDATE: Grub's client isn't as inefficient as I thought.
Search: By the People, For the People (04.17.2003)