How hard can it be?
Well for a corporation very hard. But fortunately my solution to this problem that seems to be vexing everyone lately doesn’t rely on a corporate sponsor, and places the future of web search in OUR hands.
My vision of search is that each website takes a small share of responsibility, indexing sites that website finds relevant and/or useful and sharing its results with other search engines that request the info via a standard API, much like blogs share feeds using standards such as RSS and ATOM. Each webmaster can control how much weight to give to each set of imported results in comparison to sites it indexes itself and how the results are merged. I have the following basic vision, but as different developers get involved in ironing out the final details of the standard things may change. None the less the basic principles of my idea are as follows:
Each search engine should accept requests in either/both http and/or https, probably with search parameters passed in the query string. The results then get sent back in an standard XML format, which may look something like:
<results source="http://example-search-engine.com"> <site weight="200" url="http://example-result.org" title="Example Results"> <favicon url="http://example-result.org/favicon.ico" /> <page weight="150" url="http://example-result.org/page1.php" title=""> <match><title>This title contains the search terms</title></match> <match><p>This paragraph contains one of the search terms</p></match> <match><img src="someimage.jpg" alt="This image description contains one of the search terms" /></match> </page> <page weight="150" url="http://example-result.org/page2.php"> <match><p>This paragraph contains one of the search terms</p></match> </page> </site> <site ...> ... </site> </results>
With the API open, different CMS systems can be developed that both index selected sites and process results from other search engines that implement the API. Each CMS can process the results in their own way, and allow the webmaster to configure various options in their own way also. For example, they can either integrate imported results, show them separately in a side panel or some mixture of the two depending on source. They can determine whether to show multiple results for one site or whether to group them. They may be configured to show country of the webserver of each result, or warnings if they appear on a malware list, or any other number of options. They may have options for doing image search, or map results etc, linking to whatever service they choose for these specialist searches.
A basic CMS may involve no login at all, with configuration handled via a file, whilst others may implement the API as a plugin to existing CMS systems such as WordPress, Elgg, MediaWiki or whatever. Some may allow each logged in user to configure their own preferences and weightings, suggest relevant sites to index, report sites as spam, etc.
More complex dedicated search systems may themselves allow plugins or feature advanced options. This may include scraping results from existing search engines such as Google, Bing and Yahoo, in the manner that <a href=”https://scroogle”>Scroogle</a> currently does with Google results. Again these results may have weightings applied to them by the webmasters.
A distributed solution to search means we don’t have to have massive datacenters under our control in order to compete with Google, each of us can take responsibility for indexing portions of the web important to ourselves, and share that data with each other freely for the benefit of all. To make a distributed system work we don’t need to abandon the huge stores of data provided by google until we are ready, we can in the mean time scrape from them, but ultimately a distributed system, based on qualified trust, should prove better at promoting sites that are useful over sites with good SEO but useless content.
Google is becoming less and less useful as each search tends to produce 100 sites all with the same article scraped from wikipedia, interspersed with adfarms, and I can’t help think such sites wouldn’t make it to prominence in the distributed model, as it won’t be in any webmasters interest to index useless sites. Unlike google who actually profit from many of the adfarms that host google advertisements. But yet despite this I fail to see why Microsoft or Yahoo would prove better, as both also suffer from corrupt corporate agendas. Rather than looking for another closed top-down corporate ‘saviour’ to rescue us from the latest tyrant, I believe a bottom up emergent approach can prove not only successful, but also something that will become the dominant means of doing search.
Doesn’t this exist already?
Often when I come up with a good idea, I’m not surprised to find that something like it already exists. However I can’t find anything quite like this idea at present, although some similar Peer-to-Peer based systems exist, such as FAROO and YaCy. However I don’t see P2P as quite the solution, as this involves each user installing software on their own computer. My proposed web based model lacks this requirement and thus has quite a few advantages, without precluding the idea of programs users can install that integrate searches from various sites. Also, the above proposed model precludes the ability for malicious links to be injected, as webmasters who index such sites will unlikely be trusted by others, and users will likely not use engines that fail to exclude such links.