Dragons in the Algorithm
Adventures in Programming
by Michael Chermside

Constant Crawl Design - Part 2

Suppose you were building a tool for anonymously capture the (public) websites that a user visited. What would the UI requirements be?

The basic experience would be a perfectly normal browsing experience: users would launch their favorite web browser normally, would browse around the web normally and everything would "just work". This means that clearly the system would function as a browser plug-in. Fortunately, nearly all modern browsers support some form of plug-ins. In principle, one could also develop this as a proxy, but it would be much more difficult to develop an effective UI.

An easy-to-use installation process is important if one is seeking a large user base. This means using the normal means for the platform (an installer for windows, but RPM, yum, etc for linux). It means that the installer sets up the browser plugin, allocates the disk space needed for storage, and creates the services needed to join the P2P network.

The plugin itself offers two basic pieces of functionality. One would be that it captures the content of the web as it is viewed, and (where appropriate) archives it for the crawl. The other is the benefit for the user of the plugin: it allows them to view content from the crawled archive when the normal site is slow or unavailable. (For instance, the "slashdot effect" where a smaller site is featured on a popular news site like Slashdot or Reddit and becomes overwhelmed.)

The plugin should have three basic "modes". Any well-behaved plugin should provide an easy way turn it off, so one mode is "disabled". Another mode would be for loading all content (or perhaps just re-loading the current page) from the archive. And of course there would be the normal mode (more on this in a moment). The mode affects the page rather drastically (changes where we are getting it from or whether we are potentially sharing it with the world) so the plugin should probably provide an indicator of some sort in or near the URL bar, and this indicator might as well provide the means for switching the mode as well.

What behavior would we want in "normal mode"? Pages that get viewed are eligible for sharing, but only if that page is determined to be a "public" page (see other essays for details on this). So the plugin would need to capture the content of the page and immediately after rendering it (perhaps in a separate thread) begin to process it for possible sharing. I've used the term "page", but essentially all content should be treated this way, including images, CSS and JavaScript files, even AJAX calls: any content downloaded by the browser.

The next question is when content should be downloaded from the archive. Unlike Google Web Accelerator, I think it is unlikely that this design involving anonymous P2P technology will ever be faster than ordinary browsing to a normally functioning web site. But it can be available in those cases where the ordinary site no longer is, where the HTTP request times out, or a 404 (page not found), 410 (page gone), 503 (server overloaded) or some other error is returned. The simplest solution would be to attempt to load the page from the storage network whenever these conditions occur. Always trying to load every page from both web and storage (and displaying the one that arrives first) would put far too much load on the storage network for pages that would never be viewed.

It is worth noting that this storage network can store multiple versions of each page. This could be because the page has changed over time, because it is served up differently to different classes of user (perhaps by geographic region), or for stranger reasons like a malicious user injecting false versions of a page into the network. This property might lead to some very interesting and powerful uses ("See old versions of any page!") and it might pose new technical challenges ("Allow trusted entities like the EFF to somehow flag which version of the page is 'real'"). Such considerations are an excellent subject for future consideration, but the next essay will address technical challenges with the storage and the final part will show how to anonymously determine whether to share a page.

Posted Mon 05 March 2012 by mcherm in Programming