[Any text in italics is a direct quote from the paper]
Link to the paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
I had been wanting to read this paper for a while because it has been very influential in the NoSQL databases space. And it has been one of the most successful styles to implement a KV store that provides an acceptable level of durability & eventual consistency at a massive scale. Recently, I have been evaluating DynamoDB for a couple of use-cases at my work, so I decided to read the paper itself. While the current DynamoDB is actually not a direct implementation of this paper, It is still pretty close to the tradeoffs made in this paper.
Amazon did have a few considerations & constraints that they want to put on the system right off the bat:
- They only needed a KV store. Data operations only span a single key, so table joins, etc are not required.
- It must run in commodity cloud hardware, where machines can be of different types & sizes. This makes sense economically because you will want to eventually replace some of your machines with new machines. It's in contrast to Google's Spanner database, which requires atomic clocks (which it uses well to provide strong consistency, but that's a different point).
- It should never reject writes, whether it is due to unavailable nodes, or concurrent updates. This perhaps tells us a lot about Amazon's early tech principles: customer (read: money) is really the king. They don't want to take even the slightest of downtime because that directly translates to losing revenue, and potentially customer trust.
- It should be decentralized. They don't want to run into problems with a bottleneck server at a massive scale.
Important design tradeoffs:
Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities.
The database should be able to run within different sizes of machines. Dynamo solves this by creating the concept of “virtual nodes”. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. So a large machine will potentially host multiple nodes.
It should be able to scale out one node at a time, with minimal impact. Dynamo uses consistent hashing across several nodes. Consistent hashing has a benefit that adding or removing a node only triggers rebalance in the adjacent nodes, not the whole cluster. Dynamo also uses the now widely accepted form of consistent hashing where each node gets assigned multiple points from the consistent hashing ring, instead of a large contiguous chunk.
Dynamo supports replication to a configurable number of nodes. A node handling a read or write operation is known as the “coordinator”. Each key is assigned a coordinator node, which replicates it to n-1 clockwise successor nodes. As for which nodes contain a particular key, each node can calculate which nodes should contain the particular key.
Data Versioning & Conflict resolution
Since Dynamo provides eventual consistency, you have to keep track of versioning of data to be able to resolve conflicts. Keep in mind that the write operation to dynamo key is propagated to all its replicas asynchronously. Dynamo expects applications to acknowledge multiple versions of the same data. Dynamo also uses vector clocks in an attempt to provide logical ordering between the versions. And because logical ordering cannot be provided in cases like network failure etc, with concurrent writes you might end up with multiple versions of an object.
We already know that Dynamo has weak consistency by default. The implementation offering by AWS does have an option to make all reads strongly consistent. It requires configuring two parameters: R & W. R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation. If you set R + W > N, you have a quorum-like system. But In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency. Upon receiving a put() request for a key, the coordinator generates the vector clock for the new version and writes the new version locally. The coordinator then sends the new version to the N highest-ranked reachable nodes. If at least W-1 nodes respond then the write is considered success. Similarly, for a get() request, the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, and then waits for R responses before returning the result to the client. If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back
The coordinator node can also “read repair” a node with latest data version if the version returned by it during the quorum read operation is determined to be stale.
Since most write operations are preceded by a read, Dynamo also keeps track of which coordinator node replied the fastest to the read operation & route the write request to the same node, thus increasing the chances of a clean logically ordered versioning for that key. This also increases chances of “read your writes” consistency.
For temporary failures, Dynamo uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring. Writes might go to the next available node in case a node is down for which a write operation was intended. In this case, the new write contains a metadata hint that it was intended originally for a node that was down and hence had to be written to a different node. Such hinted objects are scanned periodically and will be sent to their originally intended node once it comes back up.
For permanent failures, such as the hinted replicas go down before they are replicated back to the original node, Dynamo implements an anti-entropy (replica synchronization) protocol to keep the replicas synchronized. The central idea is that replicas can use each other to synchronize using Merkle trees (basically it’s a hierarchical checksum). Merkle trees can drill down to the smallest parts of the data set which differ between two nodes, and hence only a small chunk of data needs to be replicated back & forth for the synchronization.
Ring Membership & Failure Detection
Ring membership addition/removal is not triggered automatically because node failures are generally temporary & short-lived (in AWS). So the membership change has to be made manually which gets propagated to other nodes using a gossip-based mechanism. Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories. As for failure detection for writes, each node maintains a local list of responsive & unresponsive nodes.
Storage engines & Durability
Dynamo uses multiple storage engines, to serve different workloads (this is not client configurable). The main storage engine is Berkeley Database (BDB) Transactional Data Store, and MySQL & in-memory buffer are available as alternative storage.
To decrease the 99.9th percentile, Dynamo employs an in-memory buffer for writes, which is flushed to disk by a writer thread periodically. This trades durability for improved latency. To reduce the durability risk, the write operation is refined to have the coordinator choose one out of the N replicas to perform a “durable write”.
- It’s optimized for writes performance, which is the main goal Amazon seeks to solve, other than scaling up & down.
- At least the paper doesn’t seem to take a stance at a default conflict resolution algorithm, but the DynamoDB documentation hints that “the last write wins” is the default mechanism. However, now they also stronger support isolation levels like Serializable
- DynamoDB currently provides transactions and related isolation levels: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/transactions.html. It wasn't the case until 2018 if you don't count the client library by AWS.
- No native support for compression. you can choose to implement one at the client level.
- There seems to be no major bottleneck in scaling. Dynamo has to be one of the most scalable systems if you believe in such comparisons. I say so because all nodes are symmetric by design.