Jump Consistent Hash: A faster/leaner approach for scaling data storage

J. Kalyana Sundaram
4 min readApr 28, 2022

I was recently evaluating the options for data partitioning in one of my projects. In general, such problems can be described as:

How can a set of clients agree on assigning a set of objects (e.g., a key + value) across a set of servers/buckets?

How can we do this in a dynamic environment where the above entities can come and go? How do we minimize the re-assignments when a new server is added? How do get an even distribution of keys across servers?

I came across this algorithm called “Jump Consistent Hash” in this paper “A Fast, Minimal Memory, Consistent Hash Algorithm” by Google. In this article, I will summarize my main takeaways about it. I will describe how it compares to the original Consistent Hashing algorithm (as defined by Karger et al. in the 1997 paper) that I wrote about here.

Before jumping into it (no pun intended :)), let’s first recap the main tradeoffs of the original Consistent Hashing algorithm.

What are the main tradeoffs in Consistent Hashing?

There are two main tradeoffs of the original consistent hashing approach:

  1. Getting an even distribution of keys across nodes can be a challenge: You will need to have a high number of virtual nodes to get an even distribution of keys across nodes. Even with 1000 virtual nodes per server, you get a standard deviation of ~3.2% in the number of keys assigned to different nodes, and the 99% confidence interval for the bucket size compared to the average is (0.92, 1.09).
  2. Higher memory requirements: The memory requirement is proportional to the (number of servers) * (number of virtual nodes per server). This is because you need to maintain in memory the hash values for each virtual node and various related mappings. Because of this, it can be several MBs of memory usage in each client.

What is Jump Consistent Hash and what is the main intuition behind it?

How can we address the above limitations? What tradeoffs can we make?

Many times, when we are stuck on a problem, we can make progress on solving it by scoping the problem domain.

That’s the approach used by Jump Consistent Hash. It asks questions such as: What if removing any arbitrary server doesn’t need to be supported? What if arbitrary server names don’t need to be supported? What if the clients always know the full list of servers?

The above questions are relevant in the context of data storage applications. In such scenarios, shards can be represented by numeric IDs instead of needing arbitrary names. Also, removing an arbitrary server doesn’t need to be supported. This is possible because, unlike in a distributed cache application, once data is assigned to a shard/bucket in a data storage application, that shard is responsible for it, and it cannot be removed. At the same time, server failures can still be handled using other mechanisms such as replication.

How does Jump Consistent Hash work?

To understand this algorithm, we have to understand what a seed is. In a pseudo random number generator, a seed is the starting point for a sequence. By using the same seed to initialize it, you are guaranteed to get the same sequence of random numbers every time.

In this case, we want to determine the bucket for a key. What if we can use the key as the seed of a pseudo random number generator? For a given key, the set of random values returned from the pseudo random number generator will be predictable. Using these random numbers, the algorithm jumps forward in the list of buckets. The last bucket that it gets is the bucket that’s assigned for this key.

Here’s the intuition of the algorithm as described in the paper:

int ch(int key, int num_buckets) {
random.seed(key);
int b = ­1; // bucket number before the previous jump
int j = 0; // bucket number before the current jump
while (j < num_buckets) {
b = j;
r = random.next();
j = floor((b + 1) / r);
}
return = b;
}

The algorithm takes a key (64-bit int) and the number of buckets. It returns the bucket ID for the key. If the key is longer than a 64-bit int, you will have to hash the key first.

Remember that, unlike regular consistent hashing, arbitrary bucket names or removing arbitrary buckets are not supported.

What are the benefits of it?

By adding the above constraints, Jump Consistent Hash achieves:

Even key distribution: It achieves a standard deviation of 0.00000000764 and a 99% confidence interval of bucket size (compared to average bucket size) of (0.99999998, 1.00000002) which is a very even distribution. You can compare this to the corresponding numbers from the regular consistent hash with 1000 virtual nodes.

Minimal memory usage: It needs very little memory that fits in a few registers.

Conclusion

That was a quick introduction to this algorithm. I am looking forward to trying it out in practice. Do check out the full paper at “A Fast, Minimal Memory, Consistent Hash Algorithm”.

--

--

Software Architect in Azure @ Microsoft. I help software engineering & product leaders get better at writing. Join me at http://writingflywheel.com