Clusion v0.2.0

by Seny Kamara and Tarik Moataz

Over the last two years, we have been working on an open source encrypted search library called Clusion. We are using it in several ESL projects, including the Signal Search project we recently released. We are also excited that other groups are experimenting with it. In this post, we wanted to give an overview of Clusion and answer some of the most frequently asked questions we have received about it.

Overview

Clusion is an open source encrypted search library created and maintained by the Encrypted Systems Lab at Brown University. It is written in Java and is available under the GNU General Public License v3. Clusion is designed to be easy to use and modular. It can be used for both fast prototyping and integration into real-world applications. It implements state-of-the-art and highly-efficient searchable symmetric encryption (SSE) schemes with varying levels of expressiveness. This includes schemes that handle single, conjunctive, disjunctive and boolean keyword search.

What is encrypted search? Encrypted search is an area of cryptography that focuses on the design of cryptographic primitives to search on encrypted data. There are a wide range of possible solutions to this problem including fully-homomorphic encryption (FHE), oblivious RAM (ORAM), multi-party computation (MPC), property-preserving encryption (PPE) and structured encryption (STE). Each of these approaches provide different tradeoffs between efficiency, expressiveness and security. For an introduction to the area, see this paper and these lecture notes.

What is searchable symmetric encryption (SSE)? One approach to designing encrypted search solutions is to use structured encryption (STE). STE schemes encrypt data structures in such a way that they can be queried privately. SSE is a special case of STE that focuses on encrypting search structures, e.g., inverted indices, search trees etc.

How did Clusion come about? The idea of building Clusion started in 2015 in a bus ride heading to Saint Annes in Rennes, France. We were discussing the scarcity of encrypted search implementations and the complete lack of open source libraries. At that time, there were some implementations from Microsoft, IBM and Columbia/Bell Labs but none were open source. This was surprising because it had been almost ten years since we had practical techniques to search on encrypted data. The issue became even more pressing when we wanted to implement a new boolean SSE scheme we were working on called IEX, which relied on several building blocks and techniques that had never been implemented. So we decided to implement IEX, along with its underlying building blocks, and to make everything open source. Our hope was that an open source SSE library would help others who wanted to evaluate their constructions and would establish a common framework that researchers and practitioners could refer to and improve on.

Who is Clusion for? We designed Clusion for anyone who is interested in using or experimenting with encrypted search technology. It has already been used in several ESL projects by undergraduate, master's and PhD students.

What kind of applications can use Clusion? Clusion can be used by any application that needs to store and query sensitive data. We use it to evaluate our constructions, to build prototypes and are using it to build a system that will be deployed at scale.

Some examples of applications include cloud backup services like Dropbox, iCloud, Google Drive and OneDrive; and NoSQL databases like MongoDB and Redis. Encrypted search is not only applicable in cloud scenarios, however, but can also be used by desktop and mobile applications when storing sensitive data locally. This was the case, for example, in a recent project where we used Clusion to add a secure message search feature to the Signal app.

An important note on security. While SSE, and encrypted search technologies in general, make the design of secure and privacy-preserving applications easier, it is important to understand their inherent limitations. For starters, encryption technology as a whole can only protect data against certain kinds of threats and provides no guarantees against attacks that target users or input/output devices. Second, it is important to keep in mind that the security of your system may not reduce directly to the security of the underlying SSE scheme. In other words, the proper use and implementation of the SSE scheme may be useless if the larger system in which it is used is not designed or implemented properly. Finally, it is crucial to understand that most encrypted search solutions leak some information. In some cases like ORAM-based solutions this leakage can be very small. In others, like PPE-based solutions this leakage can be very large. SSE solutions are somewhere in the middle: more leakage than ORAM-based solutions but less than PPE-based solutions. A lot of ongoing research is trying to understand the practical impact of this leakage so deciding how and when to properly use these schemes can be tricky. For these reasons, we recommend that any real-world deployments of encrypted search technology be accompanied by a thorough review by experts in the field.

Implementation

The Clusion architecture. Clusion is built in a modular fashion and is split into several components, as shown in Fig. 1. It handles the entire encrypted search pipeline including the parsing and indexing of data, the creation of encrypted data structures, the encrypted query operations and, finally, displaying the result to the user. We give more details about the components below.

Fig. 1: The Clusion components.

Indexing. The Clusion indexer takes as input a folder and indexes the files in it. It can index pdf files, various Microsoft file formats (e.g., .doc and .ppt), media files (e.g., pictures and videos) as well as raw text such as html and text files. The different file formats are handled using PDFBox and POI. Tokenization and stemming is done using Apache Lucene. The indexer outputs two multi-maps. The first associates keywords to document filenames and the second associates filenames to keywords. The default indexer does full-text indexing but it can be modified to index only a user-defined set of keywords. The underlying data structures are built with Google Guava.

Cryptographic primitives. We use Bouncy Castle to instantiate all the underlying cryptographic primitives. In particular, we use AES-CTR for symmetric encryption and HMAC-SHA256/512 and AES-CMAC for PRFs. Key generation is based on PBE PKCS1 and we use SecureRandom as a PRG. We also provide a encrypt-then-mac authenticated encryption and a deterministic version of AES (AES with a synthetic IV). In addition, there is also an implementation of the HCB1 online cipher.

SSE schemes. As of v0.2.0, Clusion includes the following constructions:

  • 2Lev: a single-keyword SSE scheme by [CJJJKRS14]. 2Lev has optimal search complexity and is designed to be I/O-efficient so it can be disk-resident.

  • Dynamic 2Lev: a dynamic variant of 2Lev that handles additions of new documents. Here, we modified the original dynamic construction of [CJJJKRS14] to be forward-private. Note that this variant does not handle delete operations.

  • Dynamic pibase: pibase is a single-keyword SSE scheme from [CJJJKRS14]. It is has optimal search complexity and is designed to be memory-resident. The dynamic variant of pibase handles additions and deletions but search complexity is linear in the number of delete operations which means it should only be used in settings where few deletes occur. We also modified this construction to be forward-private.

  • ZMF: a compact single-keyword SSE scheme from [KM17]. This construction has linear search complexity but produces very small encrypted indexes. It should be used when space is at a premium.

  • IEX: a (purely) disjunctive SSE scheme from [KM17]. IEX has sub-linear search complexity and optimal communication complexity. IEX makes black-box use of an underlying single-keyword SSE scheme, which means that it can be instantiated using any single-keyword SSE scheme. The two instantiations implemented in Clusion are IEX-2Lev and IEX-ZMF. The former is optimized for search; achieving fast search at the cost of a large encrypted index. The latter is optimized for space; producing a compact index at the cost of slower search (note that IEX-ZMF still has sub-linear search complexity).

  • BIEX: a boolean SSE scheme from [KM17]. Like IEX, BIEX also makes black-box use of a single-keyword SSE scheme. The two instantiations implemented in Clusion are BIEX-2Lev and BIEX-ZMF, providing the same tradeoffs between search and space discussed above.

Test code. Clusion includes test code that can serve as examples of how to use its various constructions. All the tests currently run in memory and all the algorithms run on the same machine. That is, the tests do not implement any client/server scenarios. This is because Clusion currently does not include a client/server infrastructure (this is something we are working on and will be adding soon).

Future Work

There are many ways in which Clusion can be improved and we are currently working on several new features. Here are a few that we plan to release in upcoming versions:

  • Client/server infrastructure: as stated earlier, Clusion does not currently have any client/server code. This is a limitation for usage in cloud scenarios where the client encrypts its data before sending it to a cloud provider and the latter executes the search. We are working on this feature and will release something soon.

  • I/O-efficient encrypted indexes: while the 2Lev construction is designed to be I/O-efficient, this is not the case for the rest of the constructions implemented in Clusion. We are currently experimenting with ways to make non-I/O-efficient constructions more I/O-efficient in practice.

Clusion is an open source library and, as for any open source project, we hope that the community will use it and help improve it. If you are interested in helping or if you are using Clusion for a project, please let us know.