# Encrypting Distributed Databases with Provable Security

^{20}/Apr 2020

As people worldwide continue to generate massive amounts of data, distributed storage systems or distributed databases have become a necessity. Much of this data is sensitive, and it’s been the subject of breaches (Facebook in 2019, Marriott in 2018) that have affected hundreds of millions of users. Encryption is often proposed as a security solution, but when misapplied, it can lead to systems that fail to provide satisfactory privacy guarantees. My work addresses this issue by identifying and formalizing the key problems of end-to-end encrypted distributed databases. We provide formal security definitions and efficient solutions that are provably secure by using techniques from cryptography, distributed systems, and networking. Provable Security

Provable security is a methodology developed by cryptographers that formally analyzes the security of a system with respect to specific security definitions and against a particular adversary. Since a system is only as secure as the definition under which it’s analyzed, the definition must be a reasonable one that models the real world precisely. Despite their advantages, distributed systems are considerably more complicated to model than non-distributed ones. Just a few of their challenges include new failure modes, the complexity of the distributed state, and the intricacies of interactions between heterogeneous components. All of these need to be correctly captured by the security definitions.

My research addresses the privacy concerns of large distributed storage systems by showing how to safely integrate end-to-end encryption in them. I work on two of the major elements of these systems:

end-to-end encryption in distributed hash tables (DHTs). DHTs have had a huge impact on systems design and are the building blocks of many distributed databases. Adding encryption to them provides a strong foundation for building a variety of secure storage systems.

end-to-end encrypted key-value stores (KVSs), the complex storage systems that are actually used in practice.

Below I explain the challenges in formalizing security in encrypted DHTs and KVSs.

## Security Of End-To-End Encrypted DHTs

We can think of a DHT as a structure that allocates data modeled as label/value pairs across the nodes of a system. In our work, we look at what corrupted nodes collectively are able to learn about a user’s data and/or queries. For example, they might learn that there are at least m pairs stored in the DHT, where m is the sum of the number of pairs held by each corrupted node. With respect to the user queries, they might learn, when the user queries for the same label again. So a-priori, it might seem that the only information they can learn is related to what they collectively hold. While this intuition might seem correct, it is not true! In fact, the corrupted nodes can infer additional information about data they do not hold. For example, if a DHT is load balanced (this means there’s a high probability that each node receives approximately the same number of pairs), the corrupted nodes can approximate the total number of pairs in the system even if they collectively hold only a small fraction of it.

Because of this, the corrupted nodes can guess with high probability that the total number of pairs is about $mn/t$, where $t$ is the number of corrupted nodes and $n$ is the total number of nodes. This may seem benign, but it’s only a simple example to demonstrate that analyzing security in distributed systems can be complicated. In fact, we show that some of the very properties that distributed systems aim for can have subtle effects on security. In particular, we isolate two properties of DHTs, balance and non-committing allocation, and show that the security of an end-to-end encrypted DHT is a function of these two properties.

## Security Of End-To-End Encrypted Key-Value Stores

Now, let’s look at the security of more complex storage systems. Key-value stores (KVSs) are also distributed databases, but unlike DHTs, they store a label/value pair on multiple nodes instead of a single one. In fact, most KVSs use DHTs as their building blocks and add replication, a feature that allows data to be retrieved from other healthy nodes when certain nodes fail. This adds resiliency to the system.

Even though replication is helpful to achieve fault tolerance, it has its own challenges and issues, the biggest of which is to ensure consistency among replicas. Given the CAP (Consistency, Availability, Partition tolerance) impossibility result, which states that consistency and availability can’t be achieved simultaneously in the presence of network partitions, many replicated storage systems settle for weaker consistency models for efficiency reasons. Even though weak consistency has been deployed, eventual consistency has become prominent, particularly among highly scalable KVSs.

Eventual consistency states that the replicas will “eventually” converge but doesn’t tell when that convergence happens, and until then, a read can output any value(s) from the set of values that its replicas have. Which value(s) it outputs depends on multiple factors, such as how the network schedules the messages, how the KVS orders its concurrent operations, how it detects inconsistencies, how often it tries to detect inconsistencies, what happens when inconsistencies are detected, and which value(s) should be returned if multiple values are read.

As an example:

Assume that replica A has value $v_1$ and B has value $v_2$. In this case, the output of a get (read) query depends on whether the network delivered the read query message to A or B. As another example, suppose that the network reads from both A and B. Then, the question is what the user should get: $v_1$ or $v_2$ or both. Interestingly, the answer depends on the application that is using the KVS. For example, if these values represent the items in an Amazon shopping cart, it only makes sense to merge and return both of them. However, for other applications, it may be that the largest value is good enough.

Since the security of a system is tied to its correctness, it becomes necessary to capture correctness in the security definitions. However, it isn’t at all obvious how to do so when correctness is a function of numerous factors and each KVS might handle each one of them differently. In response, should we write a different security definition for each KVS, one that captures all its particular design specifics? Unfortunately, because security definitions are written for an object and not for different implementations of an object, this doesn’t make sense. “

In our work, we carefully abstract different components of a general KVS, study the subtle and intricate ways in which they interact, and use these components to define a KVS’s security definition. Equipped with these definitions, we use the techniques developed for encrypted DHTs to show that a balanced KVS is secure.

## My Work: Safely Integrating End-To-End Encryption

As we’ve seen above, with increasing amounts of sensitive data being stored in distributed storage systems and the constant occurrence of data breaches, data security and protection has become a crucial part of these systems. Encrypting data at rest and decrypting it before use isn’t enough, because each decryption exposes the data and increases its likelihood of being stolen. Therefore, the systems we design should be end-to-end encrypted. Pervasive end-to-end encryption may not just protect user data but can also help achieve compliance with legislation like the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). For example, the GDPR’s “right to be forgotten” is easily implemented with encrypted storage because a user can erase their data by simply erasing their encryption key.

My work initiates the study of end-to-end encryption in distributed storage systems. More specifically, we study how to safely integrate end-to-end encryption in distributed hash tables and key value stores, both of which are fundamental to the design of distributed storage systems.

We identify that the key property of a DHT (and also a KVS) that affects its security is its balance, which, roughly speaking, means how evenly it distributes the label/value pairs to its nodes. While load balancing is clearly important for storage efficiency, we see that it also has an impact on security. More precisely, we say that a DHT is $(\varepsilon, \delta, \theta)$-balanced if with probability at least $1- \delta$, for all labels, the probability that any set of $\theta$ corrupted nodes sees the label is bounded by small $\varepsilon$. We then describe a generic way that converts any DHT into an encrypted DHT and prove that if the DHT is $(\varepsilon, \delta, \theta)$-balanced, then the encrypted DHT is $(\varepsilon, \delta, \theta)$-secure. $(\varepsilon, \delta, \theta)$-security, at a high level, means that with probability at least $1- \delta$, the adversary who has corrupted at most θ nodes will learn “something” about at most $\varepsilon$ fraction of label/value pairs in expectation. Moreover, we show that this “something” is very little.

The next natural question we ask is whether there are any known DHTs that are balanced. Interestingly, we not only find that such DHTs exist but that Chord, arguably the most influential DHT, is actually balanced. Analyzing the balance of Chord is non-trivial and relies on solving the following probabilistic question. Suppose that n points are placed independently and uniformly at random on a circle with circumference $1$. Notice that the process induces $n$ arcs. Then, for what values of $\varepsilon, \delta, \theta$,

$$\textrm{Pr}[\textrm{sum of lengths of}\ \theta\ \textrm{largest arcs} ≤ \varepsilon ] ≥ 1 - \delta.$$

Of course, we want to find interesting values for these parameters and not the trivial ones where $\varepsilon = 1$ or $\delta = 0$. Unfortunately, since arc lengths are not independent of each other, Chernoff bounds cannot be directly applied, and advanced probabilistic techniques are employed to show negative dependence between the arc lengths.

## Future Work

We see our work as a first step towards designing provably-secure end-to-end encrypted distributed systems like off-chain networks, distributed file systems and distributed caches. Our work motivates several open problems and directions for future work.

**Beyond Chord.** One immediate direction is to study the balance
of other DHTs, such as Kademlia and Koorde. Because Koorde is similar to Chord
in structure, the bounds we introduce in this work to study Chord’s balance
might find use in analyzing Koorde. Kademlia, on the other hand, has a very
different structure than Chord, so it is likely that new custom techniques and
bounds are needed to analyze its balance.

**New encrypted DHT constructions.** Another direction is to design new encrypted
DHTs, where the adversary learns nothing as opposed to very little. This might
be done, for example, by using more sophisticated techniques from structured
encryption and oblivious RAMs.

**Crypto-friendly DHTs.** A third direction is to design new encryption-friendly
DHTs. By this, we mean designing new DHTs that achieve better balance than
Chord. Of course, this may come at a cost in the DHT’s traditional performance
metrics (for example, load balancing or routing time). Encrypted off-chain
networks. To address the shortcomings of blockchains, much recent effort has
turned to the design of distributed and/or decentralized off-chain storage
networks whose primary purpose is to store large amounts of data with fast
retrieval. Many influential blockchain projects (Ethereum, Enigma, Storj,
Filecoin) rely on off-chain storage, often using DHTs as a core building block
and featuring end-to-end encryption. Our work provides a framework to formally
analyze their security, and an analysis of Ethereum’s Swarm and IPFS may be
particularly interesting. Leveraging our framework, the main challenge left
would be to analyze the balance of the Kademlia DHT, which, as mentioned
earlier, seems non-trivial.