I am building a system to store records about businesses (within the US). As with most systems these days, it will be highly distributed: it runs in three different data centers, each running multiple machines each of which runs numerous threads. One of the things that this distributed system needs to do is to assign IDs to newly-created records. These IDs need to be unique; this is the story of the available options and what I designed as a solution.
It is always important to make sure we are clear about our requirements, and the specifics of them will control what is the best solution. In this case, here are the requirements we need to deal with:
- No Duplicate Ids - we need to ensure that no two businesses are issued the same ID.
- Avoid Cross-System Duplicates - we also have a system which issues IDs for individuals, and we want to ensure that the same ID is never used for both an individual and a business. On top of that, the exact algorithm for assigning IDs to individuals has not been determined yet. But clearly something like small integers for the IDs would run a risk of duplication.
- Highly Distributed - we need this to run in multiple threads on multiple machines in multiple data centers.
- Highly Available - the issuing of new unique IDs needs to continue to function even in the event of an outage of any individual system, a network partition, or any similar failure.
- Short IDs Not Required - some use cases will require that IDs be short enough to be passed around and processed efficiently -- "fits in a 32-bit integer" or something similar. But in our case we have no such requirement. We are thinking that "a string of up to 36 characters" is a good constraint (long enough to allow for the standard ASCII format for UUIDs including punctuation).
- Human Readable - a fairly low-priority requirement is that the IDs be recognizable and readable by humans. So, for instance, no binary blobs, and "l|11!|1l!" is probably a really poor choice for an ID. The main reason why this is nice-to-have is that we expect to use these IDs as the standard way to refer to a business across many different systems for many years to come. Although the IDs are not intended for display to customers or even internal users, there will DB admins, data analysts and programmers who will encounter the values in the wild. Making their lives easier would be kind.
- Performance Not Required - some uses cases will require that new IDs be issued extremely rapidly. But in our use case, we only need to issue IDs when we acquire a new business as a customer which is a rare operation and one that has other significant delays. Even a slow solution that took hundreds of milliseconds to issue each ID would be acceptable.
- Unpredictable IDs Not Required - some use cases will require that IDs not be predictable. A good example would be a user-visible customer ID that was used as part of an HTTP request to retrieve data. If such IDs are simply generated by a sequence and there were no other security around accessing that data, then a malicious user could retrieve customer information for other customers by simply putting in larger and smaller numbers as IDs. For our case this is not a concern because we do NOT intend to allow access to any data based solely on knowledge of the business ID. It might be nice if the IDs were unpredictable, but it isn't required.
Four Different Approaches
I know of four different approaches for implementing unique IDs in a distributed system.
Approach 1: Centralized Server Issues IDs
If all requests for an ID connect to a central system of some sort then it becomes easy to implement unique IDs. The simplest approach is a counter, more complex examples include searching a list of previously-issued IDs (the list itself becomes the central system).
A good example of this would by any system that uses a database sequence to assign IDs.
The primary disadvantage of this approach is that it requires all calls to access a single, centralized system. This can lead to availability issues when that central sysem is down or unreachable.
Approach 2: Issue Batches of IDs
A variant on approach 1, this has a centralized system issue batches of IDs to distributed components, and allows the components to issue IDs from that batch. The nature of the solution changes with batch size: if issued in batches of 20, this is little more than a small cache augmenting approach 1. At the other end of the spectrum, one could issue batches of ten million and each distributed server could run for years before needing to have a new batch issued.
For small batch sizes this has the same disadvantages as approach 1. For extremely large batch sizes it can run into problems where the process for issuing new batches has been forgotten and the team moved on years ago when a batch suddenly runs out (I have seen this happen in a production system). The issuing of batches can be particularly problematic in a situation where new components are added frequently (as happens in scalable cloud deployments). Imagine a data center loses connection and the other data centers spin up new machines to handle the load, but the new machines need to obtain a batch of IDs to issue before they can begin giving out new business IDs, and the central server that gives out batches was in the data center that is now inaccessable. The problem certainly can be solved, but it has some complexity.
Approach 3: Procedural Guarantee of Uniqueness
In this approach, each component is authorized to issue IDs, but we implement a procedure for how the IDs are generated that guarantees that no two components will ever issue the same ID. For instance, we might take the address of the machine's ethernet card (every ethernet card is guaranteed by the manufacturer to have a unique address), combined with the time represented as a count of tenths-of-millisecond time windows, plus a counter in case two IDs are issued on the same machine in the same tenth of a millisecond.
The best-known example of approach 3 is variant 1 UUIDs, a standard originally created by Microsoft, which uses the approach described above.
One disadvantage of approach 3 is that the IDs may need to be rather long. UUIDs, for example, use 128 bits which is usually expressed in readable format using 36 characters (hexadecimal digits and punctuation). Another issue is that the rules that prohibit duplication may be subtle and complex. For instance, when the UUID specification was written, "machines" actually had physical ethernet cards. But today the "machines" that we run on may actually be virtual machines in the cloud that might share a common ethernet card. Probably, this can be addressed with sufficient care (perhaps the virtual machines still have a unique identifier we can use?), but it is difficult to be certain that all situations have been considered.
Approach 4: Assign Random IDs
Finally, the probabilistic approach: just assign a sufficiently-long random value as the unique ID. If the range of random values is sufficiently large compared to the number of IDs that will be used, then the probability of getting a collision can be made as small as we like.
The best-known example of approach 4 is ALSO UUIDs, specifically variant 4 as described in the UUID spec.
Like approach 3, this approach also requires IDs that are rather long. It requires a reliable source of random numbers (but that is a resource available on all modern operating systems). It is only probabilistically unlikely result in a duplicate ID rather than deterministically impossible. And the most notable disadvantage of this approach is that it is difficult to explain to people why it reliably avoids collisions (but I will attempt to make the case below).
So, what approach do we want to go with for our purposes? Well, the disadvantages of approach 1 (centralization) directly conflict with the requirement that our system be distributed and available. So this is our least preferred option.
Approach 2 is much more feasable. We could issue moderate-sized blocks of IDs. A central server could give out blocks to newly-started components, but we would need a fallback for cases when the central server is unreachable. Perhaps issue multiple blocks to each component and a new component can contact a sibling to be given one of its spare blocks when the central server is unavailable. The problem is that this solution is now rather complicated. It is a reasonable fallback if we cannot find a more palitable approach but we would prefer something less complex.
Approach 3 suffers from much the same problem. Rather than invent our own approach, we would most likely choose to use UUID Variant 1, but with additional attention paid to making sure that our implementation continued to guarantee uniqueness even in a cloud deployment. Java provides general UUID support, but the standard library does not provide a variant 1 generator (some libraries do provide this). All told, this seems much more risky in that a poor implementation would result in duplicate IDs. Let's avoid this.
Finally, that leaves us with approach 4 to consider. This suffers from long IDs, but in our use case that is acceptable. Implementation is simple and straightforward, given a source of reliable random numbers (like java's SecureRandom). The only concern is whether the probability of collisions can be made small enough (and also whether it can be clearly explained to others). So in the next sections we will make a specific proposal based on approach 4, then analyze it.
A Specific Proposal
Each business id will consist of the letters "BIZ" followed by 20 hexadecimal digits (0123456789ABCDEF) of cryptographically secure randomly generated values (80 bits of randomness). For instance, "BIZ5FF29656A0DEE9E62B21" would be an example of a business id.
The prefix "BIZ" is used for two reasons. First, it goes a long way toward guaranteeing that no business id will overlap with a person id. Although the rules for generating a person id are not defined, it seems likely that they could be designed to not start with the characters "BIZ". If UUIDs or GUIDs are used then "I" and "Z" will not be valid characters so there certainly will not be any overlap.
The second advantage of beginning with "BIZ" is that it helps with readability. Although the business ids are not intended to be visible to end users, there are DBAs, data analysts, and programmers who will occasionally see the raw values and being able to quickly recognize that this value is probably a business id will be convenient.
Finally, 80 bits worth of randomness was chosen in order to obtain the desired extraordinarily low probability of collisions (see next section). The random bits could have been expressed in a more compressed form (perhaps base 64) instead of hex, but the increased compatibility with a large number of different systems (by having only digits and case-insensitive letters) seemed more valuable than minimizing ID length (which was explicitly NOT a requirement).
Analyzing the Risk
"Sure," one might say, "80 bits of randomness sounds like a lot. But it is random, so you can never be sure. There is still a small chance issuing duplicate IDs. How can that be acceptable?" Let us do an analysis to confirm.
We can use this formula (from Wikipedia) to calculate the number of IDs that will need to be issued before we have a 50% chance of getting a collision:
Pluggin in our 80 bits of randomness, we find it requires:
Now, there are roughly 30 million businesses in the US. Assume that we sign up every one of them as a customer in the first year. Then in the second year, we find another 30 million businesses. If we keep adding these, we will have a 50% chance of seeing a collision after only:
To put this in perspective, Google analyzed the rate of errors simply reading values from memory (where, due to a cosmic ray quantum fluxuation or a flaw in a chip, the value read out of memory doesn't match what was stored). If we simply read those same businesses in from memory we would expect to have around 3,500 errors due to physical hardware issues (and only half an error due to random collisions).
Any problem that only crops up if we acquire 43,000 times the total number of businesses and is about 7,000 less likely than the computer simply making a mistake reading from memory is not a problem that we realisitically need to be concerned about.
(And, if it helps to settle your mind, it is we can configure the database such that if we attempt to insert a colliding record it will be detected as an error and the create will fail. Of course any failing write is thousands of times more likely to be due to physical corruption within the computer than it is to be due to a random collision.)
Let's use approach 4, with business ids that look like "BIZ5FF29656A0DEE9E62B21". It does not require any centralized server, is extremely simple to implement, and is probabilistically safer than simply using a physical computer.