Dragons in the Algorithm
Adventures in Programming
by Michael Chermside

Constant Crawl Design - Part 1

Do you remember Google Web Accelerator? The idea was that you downloaded all your pages through Google's servers. For content that was static, Google could just load it once, then cache it and serve up the same page to every user. The advantage to the user was that they got the page faster, and more reliably; the advantage to Google was that they got to crawl the web "as the user sees it" instead of just what Googlebot gets... and that they got to see every single page you viewed, thus feeding even more into the giant maw of information that is Google.

Well, Google eventually dropped Google Web Accelerator (I wonder why?), but the idea is interesting. Suppose you wanted to build a similar tool that would capture the web viewing experience of thousands of users (or more). For users it could provide a reliable source for sites that go down or that get hit with the "slashdot" effect. For the Internet Archive or someone a smaller search engine like Duck Duck Go, it would provide a means of performing a massive web crawl. For someone like the EFF or human-rights groups it would provide a way to monitor whether some users (such as those in China) are being "secretly" served different content. But unlike Google Web Accelerator, a community-driven project would have to solve one very hard problem: how do this while keeping the user's browsing history secret -- the exact opposite of what Google's project did.

This topic came up at a meeting of the Philly Startup Hackers group, and after an entire evening of vigorous discussion, we think that such a project would be technically feasible. In this series of essays I will attempt to outline the technical architecture of this solution. This first one will explain the major components and how they fit together.

Broadly, I'll describe three different problems to be solved and we'll assume that the solution to each one is a layer in the architecture. Problem (A) is the user interface, problem (B) is deciding what information is public (surprisingly, this turns out to be the most difficult part), and problem (C) is storing the pages.

The solution to problem (A) (user interface) is quite straightforward. That is not to minimize it: implementing the user interface well is by far the most work of the whole project and the piece most likely to contribute to the success or failure of it. But the approach to take is clear. This should integrate into the customer's browser, and with modern browsers that means implementing it as a browser plug-in. Also essential to the user experience is the installation experience: to be successful, this needs to be extremely easy to install and very simple to configure (preferably with no configuration required for use).

Problem B is to decide what pages should be public, and which are private to the user. The UI can help here if users can easily flip into modes where everything is captured or where nothing is. But one cannot expect the user to click something before (or after) every page -- there also needs to be a "normal" browsing mode. We'd like to (anonymously) record the majority of pages visited (after all, that's the point of the tool), but a page showing your bank account probably shouldn't be shown, nor should a Google Docs essay you've been writing. Assuming that everything viewed with HTTPS is private and everything viewed with HTTP is public seems much too simple a rule, particularly as privacy sensitive sites are beginning to default to HTTPS for all users.

So the approach I am proposing is that we assume that if several people see the exact same page then it must be a public one. My bank's logged-in view of my accounts won't be seen by any other users, while the Google Doc essay I share with a friend will only be seen by a couple of people. If we set the threshold to something like 6 or 9 users then we can be fairly confident that the content was public. To capture rarely-seen sites we'd want the count to last for some time: 6-9 users within the same month, perhaps. Now the technical challenge is to figure out how to tell whether several people have seen the content without revealing it (since it's private) and without leaving any trace that we have viewed it (for privacy reasons).

Problem C is storing the content. Spotify is a popular music player which has been installed by millions of users. Yet they don't need huge servers to transmit all those data streams. Instead, they use P2P technology and each user provides a certain amount of storage and a certain amount of bandwidth. Other projects like Freenet have proven that P2P sharing can store data and keep all the participants anonymous. So I propose leveraging fairly standard P2P approaches (or better yet, an existing P2p storage network) for storing, finding, and retrieving the content.

Well, that's enough for one essay. Check back in part 2 for more details about the UI, part 3 for a discussion of the storage network, and part 4 for analysis of how to anonymously determine whether something can be shared.

Posted Sun 04 March 2012 by mcherm in Programming