Technical Report for NGI Assure Grant 2021-08-39

Hyper Hyper Space: Conclusions and ideas for building p2p secure data sync

by Santiago Bazerque, Hyper Hyper Space Projectv1.1, September 24th 2024
Summary

The Hyper Hyper Space project aims to build a Byzantine fault tolerant, general purpose data replication system. In practical terms, this means applications using Hyper Hyper Space for data synchronization could connect their storage systems directly to the open Internet. The replication protocol would be aware of both the application's data representation invariants and its data access control rules. Applications can work autonomously on a local replica of the data, greatly simplifying implementation and delegating security and correctness concerns to the provided sync system. Furthermore, standardization of the synchronized data formats could result in greater interoperability and new integration mechanisms. Replication of self-verifiable data could become an alternative to remote procedure calls in APIs.

This report summarizes our conclusions after:

The following sections detail the different design choices and technical alternatives that were explored. They are intended to help guide future development of Hyper Hyper Space and to share this experience more broadly with other developers of synchronization engines and distributed end-user applications.

We provide a detailed description of the data model we're proposing, reference recent publications framing this problem, and justify the trade-offs involved.

Finally, we propose next steps to progress from the new prototype to a production-ready system.

Contents

Design choices New data model Conclusions and recommendations Feedback Funding Acknowledgements References

Design choices

Designing a sync system that affords greater levels of autonomy, resiliency, and privacy while resulting in applications with good ergonomics involves numerous compromises. In this section, we critically survey our design choices.

Platform

The current version of Hyper Hyper Space primarily targets web browsers, using advanced but well-standardized features to run a full synchronization node inside a browser tab. These features include IndexedDB, WebRTC, Web Workers and Web Crypto. We chose this approach because the browser is the most widely deployed secure virtual execution environment available, especially in personal computing environments. Additionally, operating a web browser doesn't require overly technical skills.

In retrospect, this decision has two important drawbacks:

We believe it would be better to adopt an agnostic approach to networking, storage, and crypto primitives, and develop the core sync engine and its protocol independently of platform restrictions. The browser remains an interesting target for the reasons stated above, but not exclusively or as the main platform target.

Byzantine fault tolerance

The general design of the library is heavily influenced by the requirement of having a Byzantine fault tolerant sync protocol, as this implies that all stored data needs to be self-verifiable.

The current version of the library uses a cryptographically secure log to ensure the integrity and completeness of replicated data. The operational semantics implementable over the log are easy to reason about, especially in coordination-free use cases. However, handling coordination in a purely operational setting can be challenging.

This approach also steepens the learning curve for application developers, who need to specify their data integrity and access policy requirements in terms the sync protocol can use automatically.

Despite these challenges, we believe Byzantine fault tolerance is unavoidable for bringing end-user autonomy and data interoperability to multi-user environments. Without it, our use-case would be limited to personal or in-group applications, excluding apps targeting broader society that could provide practical alternatives to large cloud-based platforms. Learning to use the library properly will still be significantly easier than devising secure sync primitives from scratch, and we may provide ready-made adapters for common use cases like user authentication and capability-based systems.

Coordination-free / Local first operation

Autonomous operation, meaning the ability to run an application with intermittent network access or even offline, is one of our core objectives. However, in the absence of coordination, almost any non-trivial application will encounter conflicts. A detailed example using a capability system is presented in the Hyper Hyper Space White Paper. This reality pushes us beyond using Conflict-free Replicated Data Types (CRDTs) exclusively as our modeling building blocks, necessitating some form of conflict resolution layer.

We explored two alternatives:

Both issues (expressivity and ergonomics) are addressed in the new proposed model.

Architecture

Hyper Hyper Space's interface for application developers is a library of replicated data types, covering well-known CRDTs: observed-removed sets, replicable growable arrays, multi-valued and last-writer-wins registers, etc. When these types are combined to form an application's data model, the causal linking / attestation mechanism can provide convergence to a consistent state. All types implement operation-based CRDTs, with operations stored in a database (IndexedDB in the browser or SQLite when running standalone). The database primarily functions as a content-addressable store, saving immutable objects indexed by their hashes. Operations must be fed to a running instance of the data types for use.

Initially, the system re-played all operational histories to reach the current state on startup. This proved too costly, so we implemented periodic saving of the latest state snapshot to reduce start-up time. However, this doesn't help with initial sync, and cold start-up times can still be slightly annoying. The log structure makes it challenging to prioritize replication of data needed earliest.

Applications are expected to compose and extend these types, embedding application rules and logic in a way that makes them available to the replication mechanism. Some application developers expressed that this architectural decision is too invasive and would prefer a clear separation of application behavior and storage.

The transformation of in-memory data structures implementing the replicated data types into Merkle-ized (hash-linked) structures is transparent and automatic. While this offers excellent development ergonomics, it makes synchronization unnecessarily complex. We believe it would be wiser to implement this memory-to-hash-linked translation as a separate, optional layer. This aligns with the idea of using data sync as a narrow waist, following the TCP/IP example. See the White Paper for details.

Data structures

The idea of using a Merkle-DAG as an operational log and implementing CRDTs over it is now well established. See this paper by M. Kleppmann for details and a comprehensive survey. While it has some drawbacks (difficulty of partial secure replication, no way to verifiably refer to materialized state), we have found no other base structure that enables implementing our optimistic conflict resolution solution in a Byzantine environment. In the next section, we discuss how the operational log could be complemented with other structures to overcome these limitations.

New data model

Introduction

To address some of the shortcomings described above, we've implemented a small prototype exploring ideas for a new data model.

First, we analyzed ways to provide Merkle proofs of state properties using Merkle Search Trees. See this paper for definitions and applications to state-based CRDTs. This would allow us to add secure partial replication to the sync protocol.

Second, we've utilized a new conceptual framework that characterizes precisely the problems solvable by coordination-free distributed systems. These are problems admitting solutions modeled as monotonic functions between inputs and outputs. Using this insight, we demonstrate an informal technique to generate a relaxed version of a problem that admits a coordination-free solution, even if the original formulation does not. This relaxed version, implementable using our proposed new data model, can sometimes substitute for the original formulation. In cases where that's unacceptable, it can be elegantly wrapped in a coordination mechanism.

Merkle search trees

We analyzed two candidate data structures for Byzantine fault tolerant synchronization of state:

We found Merkle Search Trees more appropriate for our prototype. Since the Merkle-DAG's logical clock mechanism remains necessary for conflict resolution, we adopted a hybrid approach: we maintained the operational approach but attached Merkle search tree roots to each operation.

We identified two primary uses for state materialization:

Coordination-free conflict resolution

Hyper Hyper Space's current state synchronization protocol works by replicating operational logs and replaying them in each peer to recreate the latest state. Direct state sync (e.g., using the trees discussed above) has slightly weaker verification properties. However, there's a second reason for using a log: conflict resolution.

As operations are appended to the log, they're disseminated by epidemic gossip and should eventually reach all participating peers. The log defines a partial ordering of operations: an operation is greater than all other operations present in the local replica when it was appended. This is sometimes called a causal ordering and provides an effective form of logical clock.

Operations are guaranteed to be processed according to this causal ordering, but concurrently appended operations (in different replicas) will be processed in arrival order (they're not comparable in the causal ordering). Depending on network factors, this may result in different orderings for different peers. When the state comprises solely independent CRDTs, this never results in a state conflict. However, our system allows for dependencies, opening the door to classic race condition problems. These are solved by a coordination-free conflict resolution mechanism that detects races and resolves them deterministically.

It's worth discussing why these state inter-dependencies (that can lead to conflicts) are necessary. Following the approach in the CALM paper, the class of problems solvable by coordination-free distributed algorithms corresponds exactly to those admitting solutions that map inputs to outputs monotonically. The most common approach uses sets. Consider these examples:

A formal treatment of this correspondence can be found in these papers by Ameloot et al. Let's consider more examples to strengthen our intuition. A capability system where capabilities can only be granted is monotonic, thus accepting a coordination-free solution. However, if capabilities can be both granted and revoked, it becomes non-monotonic. In capability systems used in Hyper Hyper Space-based apps (e.g., this wiki with per-user read & write permissions), when a capability is revoked and used concurrently (as per the causal ordering), whatever it was used for (and any dependent operations, transitively) are automatically rolled back.

Another interesting example is this analysis of tree move operations by Kleppmann et al. This problem is non-monotonic, as replicas can diverge if concurrent moves are processed in different orders. Worse, some orderings may produce invalid states (e.g., forming cycles, violating the tree structure). The paper proposes a coordination-free solution where a subset of move operations is selected deterministically, guaranteeing both convergence and consistency. Other operations are ignored (undone if necessary).

In practical terms, we've been finding relaxed versions of the original problems, where conflicting behavior is rejected after the fact. The key insight from building the prototype is a new mechanism to implement this behavior, described below.

A replica will comprise a fixed collection of operational-CRDTs, referred to as objects, each using the secure operational log described earlier. Optionally, operations will have a precondition - a predicate evaluated against other CRDTs' states in the replica before the operation is applied. This evaluation occurs when the operation arrives, and application is conditional on the precondition being satisfied.

As part of its state, each CRDT in the replica will have pointers to positions in the operational logs of CRDTs it depends on for precondition evaluation. These pointers will be moved forward by operations (when new foreign state is detected), as they're part of the object's state as well.

When a pointer moves forward, any concurrent operations (per the causal ordering) will generate non-determinism in the state observed by the rest of the system (depending on whether the pointer-moving operation is received before or after others). To address this, Rule 1 will be applied:

While this restores B's eventual convergence across all replicas, Rule 1 alone is insufficient. If object C moved its state pointer to q before p arrived (admissible since they're causally concurrent), then C's operations may have had preconditions evaluated on q alone, which is stale. Another perspective on this is that some merged states should be considered intermediate and unobservable.

To address this, each pointer will always point to a single operation. Forked states, where several operations have been applied concurrently, must be merged before referencing. When this happens, the merge operation will compute a pointer to the log position just before the first divergence being reconciled occurred. As pointers used for operation precondition computation in other objects move forward, they'll sometimes encounter these merge operations. This triggers Rule 2:

Note that Rule 2 generates more internal states that should not be observed by other objects, as they're not stable across replicas. To address this, we ensure the conflict resolution process is transitive. As with merge operations, when an operation moves a state pointer forward resulting in log reapplication (with potentially different outcomes), we tag the pointer forwarding operation with a reference to the last stable state in the log (unaffected by the state pointer advance). This enables the formulation of our final rule:

Since we require that state dependencies within the replica are acyclic, we're confident that Rule 3 can be applied in a finite number of steps. The combined application of rules 1-3 as state synchronizes guarantees eventual state convergence in every replica.

We'll now see a few examples of how state dependencies and preconditions can be used to model (relaxed versions of) non-monotonic problems.

The capability system described before can be represented using an observed-removed Set to hold <user, capability> pairs, and other structures whose access is controlled by holding specific capabilities would just use preconditions asserting that the appropriate pair is in the set.

The permissions system in Hyper Hyper Space's wiki would be straightforward. It uses a fixed set of owners, a set of moderators changeable only by owners (so far monotonic), and a users set. User admissions/removals and moderation activities must be implemented by operations preconditioned on authorship by identities in the moderators set. Wiki edits are preconditioned on authorship by identities in the user set. This exemplifies the need for the transitivity rule: a moderator can admit a user, the user can edit, and the moderator can be concurrently removed from the moderator set. This resolves by removing the user from the users set and transitively removing their edits from the wiki.

Finally, revisiting the tree move operation analyzed by Kleppmann et al.: Their proposed solution uses an operational log, timestamping tree operations. When operations are appended concurrently to the log and arrive out-of-order, they're undone and reapplied according to timestamps. This linearization ensures eventual convergence of the tree's state and consistent application of cycle-prevention rules. In our proposed model, this could use an append-only linear log to hold tree edit operations (an RGA or any CRDT providing linear order would suffice). Merge operations on the operational log will linearize concurrent additions. A second type would observe the linear log and apply operations to create the tree. When merges occur on the log object, rules 1-2 ensure the tree is recreated using the latest ordering. If other state parts need to operate conditionally on the tree's state, rule 3 would guarantee overall convergence.

Co-transactions

Let's summarize the conflict resolution method described above:

We've explored ways to present this method in a more intuitive way to the application layer. There's a parallel between coordination-free distributed systems and algorithms that operate on infinite data structures, as the absence of coordination prevents the participants in the system from knowing when they have seen all relevant data. This situation is analogous to operating on an infinite data structure. More formally, processing an operational log according to Rules 1-3, and projecting only the states where the log was not on a forked state, would yield a bisimulation between any two causally-compatible operation streams.

We propose an analogy: just as co-induction reasons about infinite structures, we'll define co-transactions as a notion dual to traditional transactions in systems with strong consistency guarantees. Traditional database transactions have straightforward semantics: a transaction makes assumptions about the state (sometimes implicitly by reading portions of it), and the system assures these assumptions hold for the transaction's duration, signaled by a commit statement. If this is not feasible, the system aborts the transaction to preserve consistency.

This schema is impossible in a coordination-free system; there's no way to lock the read state to prevent concurrent edits. However, if state reads could be automatically detected, as in traditional databases, preconditions asserting the observed state could be system-generated, making ensuing modifications conditional on these state parts remaining unmodified concurrently. The next step is clear: use operational replication (with rules 1-3), and reason co-inductively to assert consistency properties of the sequence of generated states. Transactions would be transitively aborted when interfering with each other. How this works in practice needs to be investigated.

Conclusions and recommendations

Feedback

This report has been archived in this GitHub repository. We welcome discussion in the issues section.

You're invited to join Hyper Hyper Space's Discord server for further discussion.

The author can be reached at [email protected].

Funding

This project is funded through NGI Assure, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.

Acknowledgements

This article incorporates corrections and improvements by Micah Fitch and José I. Orlicki, and it builds upon valuable feedback provided by Michiel Leenaars and the strategy team at NLnet. The idea of using Merkle Search Trees to materialize CRDT state was suggested by Martin Kleppmann.

References

Hyper Hyper Space
Byzantine fault tolerance and replicated data types
Consistency in coordination-free systems
Designed using Cavepaint.