Constant Crawl Design - Part 4
Suppose you wanted to build a tool for anonymously capturing the websites that a user visited and keeping a record of the public sites while keeping the users completely anonymous so their browsing history could not be determined. One of the most difficult challenges would be finding a way to decide whether a site was "public" and to do so without keeping any record (not even on the user's own machine) of the sites visited or even tying together the different sites by one ID (even an anonymous one).In this essay I will propose a possible solution to this.
The first issue is to decide your definition of a "public" site. Simple solutions like "anything over SSL is private" fail in both directions: some content which is not private gets served up over SSL (generally a good thing), and some content which should be private gets served up over unsecured connections (like Facebook pages). I propose the following rule instead: any page which is seen with exactly the same content by at least N different individuals will be considered public. (The value of N can be debated, but something like 6 or 9 could be a starting place.) This means that any pages that personalize to each user (even just a "Hello Janet Smith" at the top) will not get shared -- that may well be exactly the behavior that is desired. And it means that any page that IS shared must at least be known to a reasonably large group (so it's not THAT secret). This definition works reasonably well for things like Google Docs (stays private unless it is widely shared) but works poorly for things like a corporate intranet (may get shared unintentionally if too many people use the tool). Perhaps a configuration in the tool could allow certain domains to be excluded from sharing; even more likely is that no company with serious security concerns will allow a tool like this to be run on their intranet anyway.
If we accept this definition for when pages should be shared, what we have left is just a technical problem (albeit a difficult one): how can we determine whether the exact content of a certain page has been seen by at least N other users without keeping a record anywhere that it has been seen by any individual user. Storing information is not difficult; we have already accepted that this project will use a P2P storage network and use some local storage so we can store information locally or inject it into a P2P network (perhaps a different network than the one used for document storage). But some obvious approaches won't work. We can't store a record of what has been seen on the local disk because that would record the entire browser history. We can't store it in the P2P network in a form that can be retrieved because then we have already made the page public. And we can't store it tied to a particular user-id (even an anonymous one) because tying together different pages visited is an effective way to identify an individual.
I will motivate the final solution by showing a series of partial solutions, each of which gets closer to solving the problem. As a first pass, we could use a P2P network which manages a large distributed hash table in which we can anonymously store values (multiple values for a given key) and which allows us to anonymously query the values. When a user viewed a page, they would use its URL as a key, and store a hash of the page as one value for that key, along with a counter. If the counter had already reached N then the page was public and program could go ahead and store the value in the storage network. The contents of non-public pages are not stored (locally or in the network), only their hash value. Old values would be kept in the P2P network for a good long time -- a month or so would be good, to be sure of capturing the "long tail" of rarely viewed content, but would eventually be cleared out to make space for newer content.
This fails because there may be malicious participants in the P2P network. As soon as a malicious participant saw a value being inserted with a count of 1 they could immediately insert it again N-1 times. Or perhaps they could even lie and return a count of N even though the count was actually 0. In either case, users would be tricked into storing content that had not actually been seen by N distinct users.
To protect against the malicious participants we need to clarify our threat model. At this stage (where we determine what content to make public), a malicious participant is one who is trying to force the user to reveal content that shouldn't be public. Since public is defined as "seen by more than N individuals", a malicious participant who already knew the content could just publicize it directly (or push it into the storage network), so it is malicious participants who do not know the content of the page that we are concerned with. That leads immediately to a solution: instead of trusting the count, we can store something that only a user who had actually seen the page could generate: a "proof" that the page has been seen.
Simply create N fixed blocks of text. (It could be as simple as "This is block <N>" -- the content doesn't really matter.) To obtain proof number n, concatenate the page text with block n then take the hash of it. A user will query the P2P network for all hash values stored for a certain URL. Some of them may match the hash codes for proofs 1 through N. If all of those hashes are present, then N other participants have seen this value and it should be pushed into the storage network; if not then the lowest-numbered missing value should be added. A malicious participant, not having the content, cannot generate any of these values, and they cannot determine anything from the hash values themselves (since a hash cannot be inverted).
There is, unfortunately, one problem remaining. If a single user views the same content N times, they might push all N proofs into the network all by themselves. The values cannot be tagged with an ID of the individual who found it because that would allow multiple page views to be tied together, leaking identity information. And they cannot keep a record of which sites they have visited as that too would reveal information. The solution I propose solves this problem but replaces the certainty that there were N viewers with a probabilistic solution.
When the application is first installed it will select a random "user id". This is just a string which is used to identify the installation so we can be sure not to double-count it. There needs to be a way for the user to read the "user id" and to modify it. That way someone who frequently uses several different computers can set the user id to the same value on all of them and will not double-count views just by moving between home, work, and mobile locations. Of course, it is important that we do not link this user id to the view history.
Now, instead of creating N different blocks to concatenate with the page contents we will create K different blocks (where K is somewhat less than N). For instance, K=8 is a reasonable value; certainly we require that K is much smaller than the number of users of the crawling system. As before, query to see which of the K proofs are present in the P2P system. The system takes a hash of the user id, mods by K to select a random (but consistent) value, k (where k is in [0..K-1]) essentially dividing all users up into K groups. If the proof for k is present, do nothing: it is possible that this proof was added by this same user. If the proof for k is missing, then add it. And if the total number of proofs is more than some threshold n, then enough people have seen the page and it can be published into the storage network.
What should n be? Well, malicious participants cannot affect anything since they don't have the page contents and can't generate the proofs. Each actual viewer of the page has what is essentially a randomly chosen value for k. So as people view the page, they are selecting a random value of k (from 0 to K-1). This continues until n distinct values have been chosen (some may have been chosen multiple times). We observe the count of distinct values seen so far (call it m) and want to estimate the number of actual page viewers.
I'll skip over the math and jump right to the answer: the expected number of actual viewers is:
FIXME: HERE I NEED TO INSERT SOME TEXT THAT I MANGLED WHEN CONVERTING THE BLOG.
So if we use 8 for K and we see 4 entries, that means there must have been at least 4 users seeing the same content and there was probably about 6. If we see 5 entries then there must have been 5 different individual seeing the same content and there were probably 306/35 or about 8.7. The values required can be adjusted to reach the desired value for expected number of individuals seeing the page, while keeping K small (if it is large enough that the number of participants in on bucket is small then it can potentially be used to help determine your browser history).
Well, that's enough essays on this topic for now. If you want to learn more, you may want to check out https://github.com/titanous/constantcrawl where there may eventually be an effort to build this.