Paper: https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf
What questions does this paper shed light on?
- 我们是否能将 Raft 中提到的服务间合作封装成一个通用的服务? - 如果可以,API 应该设计成什么样?其他的分布式系统应该怎么使用这个服务? 
- 我们在一个分布式系统总投入了 N 倍的机器,能够得到 N 倍的性能提升? 
Performance
Raft:在添加了更多的 replicas 之后,因为 Leader 需要等待响应的机器增多,反而会降低性能。ZooKeeper 提高性能的一个基本思想是:
- 将 Read 负载分散到各个 Replicas 机器中,使得读性能能够随机器数量线性地提升;
- 但是在传统的 Raft 架构下的,直接读取 Replicas 会遇到一些问题:- Replica may not be in majority, so may not have seen a completed write;
- Replica may not yet have seen a commit for a completed write;
- Replica may be entirely cut off from the leader;
 
How does ZooKeeper skin this cat?
- By changing the definition of correctness: It allows reads to yield stale data, but otherwise preserves order.
How does ZooKeeper guarantees ordering?
- Linearizable writes:- Clients only send writes to the leader;
- The leader chooses an order (numbered by zxid) responses to client;
- Leadermust preserve client write order (- zxid) across leader failure,- Replicasmust enforce “client’s reads never go backwards in- zxidorder”;
 
- FIFO client order:- Client’s successive reads execute at non-decreasing points in the order. Each read request will carry previous zxidwrite order;
- Server may block a client’s read to wait for previous write, or sync; (If the most recent zxidthat replica sees happens before thezxidclient assign)
 
- Client’s successive reads execute at non-decreasing points in the order. Each read request will carry previous 
简单地说,ZooKeeper 通过以下方式显著提高了读性能:
- 丢失“读”的一致性,但是保证了单客户端“读写”一致性(也就是说如果是我之前“写”的内容,我一定可以“读”到);
- 这一一致性的损失其实是可以接受的,因为在大部分的系统中,我一般只关心自己的内容;
Other performance tricks:
- Client can send async writes to leader;
- Leader batches up many requests to reduce net and disk-write overhead;
- Fuzzy snapshots (and idempotent updates) so snapshot doesn’t stop writes;
性能分析:
- 高“读”吞吐:“读”性能随机器数量增加而增加;
- 低“写”吞吐:“写”性能随机器数量增加而降低(21000 次写/秒);
General-purpose Coordination Service
API
ZooKeeper API Overview:
- States: A file-system-like tree of znodesconsist of:file names/file content/directories/path names/version number;
- Types of znodes:Regular,ephemeral,sequential;
Operations on znodes:
|  |  | 
Example
Example1: Increment a number stored in ZooKeeper znode:
- 需要考虑的情况: - readreturn stale data; another client concurrently- writesvalue。pseudocode:- 1 2 3 4- while true: x, v := getData("f") if setData(x + 1, version=v): break
Example2: Simple Locks:
- Pseudocode: - 1 2 3 4 5 6 7- acquire(): while true: if create("lf", ephemeral=true): return success if exists("lf", watch=true): wait for notification release(): (voluntarily or session timeout) delete("lf")
Example3: Locks without Herd Effect:
- 上面简单锁的问题:当存在大量客户端竞争同一个锁资源时,会导致锁竞争问题:每次锁的释放都有大量客户端发起竞争请求,但是实际却只有一个客户端得到了锁资源; 
- Acquire Pseudocode: - 1 2 3 4 5- create a "sequential" file while true: list files if no lower-numbered: return lock is acquired! if exists(next-lower-numbered, watch=true): wait for event...
这类使用 ephemeral 或 sequential 的锁被称为 Soft Lock,因为在客户端长时间未响应时,ZooKeeper 会主动删除文件,它适用于可重复贡献问题(计算两次是可以被接受的)。
如果要使用原子锁,应该使用 ready 文件执行最小事务(删除后再添加表示执行结束)。
More
ZooKeeper is a successful design.
Topics not covered:
- Persistence;
- Details of batching and pipelining for performance;
- Fuzzy snapshots;
- Idempotent operations;
- Duplicate client request detection;