Constant Crawl Design - Part 3
Suppose you were building a tool integrated with web browsers to anonymously capture the (public) websites that a user visited and store them to a P2P network shared by the users of this tool. What would the requirements be for this storage P2P network?
There are many different types of P2P networks for storing an retrieving files. Different networks and protocols have been designed for different purposes and thus have different strengths and weaknesses. For instance, Freenet is designed to store and retrieve files with a focus on extremely strong anonymity guarantees and resistance to files being deleted, but with weaknesses such as being particularly slow. The Gnutella network is decentralized and efficient but the usage is not anonymized. Rather than fully specifying the design of a storage network, this essay will just attempt to list the requirements. Re-using an existing protocol (or even an existing network) might be the best approach; next best would be to design one by combining well-tested components from existing successful networks.
The P2P storage network would need to support the following functions:
1. Storing values by key with some metadata. The key would be fixed and known (the URL of the content). The metadata would be small -- things like the date the content was viewed and the hash of the content. The value would be the content of the URL and would not be small -- it would be the size of a web page, image, PDF, or whatever was being stored. (Content larger than a certain size could be rejected.)
2. Storing multiple values by the same key. A page may change or may be viewed differently by different users. The storage network needs to support storing these multiple values.
3. Querying the metadata of a certain key. In particular, we need to query to find out whether there already exists an entry for a given URL (key) with a particular value for the hash of the content. To prevent bad actors from responding "yes" when it is really absent we may need to return the value. We would expect this query to be by FAR the most common query performed -- hundreds of times more frequent than any other function the storage network would perform.
4. Retrieving the value (and metadata) for a given key. If multiple values were stored we would either return all of them or select one by metadata.
5. Querying to find out what keys had new values stored within some recent time period. This would not be used by normal users, but by organizations like the Internet Archive or Duck Duck Go that actually wanted to download the contents of this storage, thus making it into a means for crawling the entire web. It would be acceptable if the only way to perform this query were to contribute significant resources to the P2P network and also if the query were likely to be accurate instead of certain to be accurate.
Those are the only functions that would be required, but there are some broad characteristics of the P2P network that are also essential:
A. Anonymity: it should be quite difficult to determine who is performing any of calls 1 through 4. (It is OK if call 5 is not anonymous.) Otherwise, the network could be used to determine a user's browsing history. This could be accomplished via onion routing, the elaborate approach used by Freenet, or any other solution that makes it impossible for any participant in the network to tell for sure what other members of the network are searching for.
B. Legal Protection for Stored Content. Legal liability for storing content on a P2P network can be a genuine problem. But in essentially all jurisdictions this can be avoided if the individuals participating in the network are not able to determine just what content they have. For instance, some systems break each file up into chunks which are encrypted so they cannot be understood without possessing all chunks, then store different chunks in different locations. No one person has any file on their system.
C. Reliability. The P2P system must retain its data even if peers join and drop out with some regularity. This means keeping redundant copies of everything. It must also function if some minority of the members of the P2P network are antagonists who will abuse the protocol in an attempt to harm the network. These are standard features on most P2P networks.
D. Purging of old values. The size of the P2P storage network cannot simply grow forever. So values need to be evicted eventually. Evicting values that are older than a certain age should work fairly well -- if the content is still being viewed frequently it will be re-loaded rapidly and by keeping for a certain amount of time we would provide an opportunity to download the content for those using this to crawl the web. Also acceptable, but perhaps less ideal, would be automatically evicting the least frequently requested values.
That's about it for requirements for the storage network, except that given these constraints we would like for it to be as fast as possible, especially for frequently requested values. This implies that the number of network hops should be minimized (despite this, anonymity will require a fairly large number) and may favor solutions that store frequently-requested information in multiple locations.
The next (and final) installment of this series will discuss the interesting (and challenging) question of how to tell when a viewed page can be made public.