Paper: https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf
What questions does this paper shed light on?
我们是否能将 Raft 中提到的服务间合作封装成一个通用的服务?
如果可以,API 应该设计成什么样?其他的分布式系统应该怎么使用这个服务?
我们在一个分布式系统总投入了 N 倍的机器,能够得到 N 倍的性能提升?
Performance
Raft:在添加了更多的 replicas 之后,因为 Leader 需要等待响应的机器增多,反而会降低性能。ZooKeeper 提高性能的一个基本思想是:
- 将 Read 负载分散到各个 Replicas 机器中,使得读性能能够随机器数量线性地提升;
- 但是在传统的 Raft 架构下的,直接读取 Replicas 会遇到一些问题:
- Replica may not be in majority, so may not have seen a completed write;
- Replica may not yet have seen a commit for a completed write;
- Replica may be entirely cut off from the leader;
How does ZooKeeper skin this cat?
- By changing the definition of correctness: It allows reads to yield stale data, but otherwise preserves order.
How does ZooKeeper guarantees ordering?
- Linearizable writes:
- Clients only send writes to the leader;
- The leader chooses an order (numbered by
zxid
) responses to client; Leader
must preserve client write order (zxid
) across leader failure,Replicas
must enforce “client’s reads never go backwards inzxid
order”;
- FIFO client order:
- Client’s successive reads execute at non-decreasing points in the order. Each read request will carry previous
zxid
write order; - Server may block a client’s read to wait for previous write, or sync; (If the most recent
zxid
that replica sees happens before thezxid
client assign)
- Client’s successive reads execute at non-decreasing points in the order. Each read request will carry previous
简单地说,ZooKeeper 通过以下方式显著提高了读性能:
- 丢失“读”的一致性,但是保证了单客户端“读写”一致性(也就是说如果是我之前“写”的内容,我一定可以“读”到);
- 这一一致性的损失其实是可以接受的,因为在大部分的系统中,我一般只关心自己的内容;
Other performance tricks:
- Client can send async writes to leader;
- Leader batches up many requests to reduce net and disk-write overhead;
- Fuzzy snapshots (and idempotent updates) so snapshot doesn’t stop writes;
性能分析:
- 高“读”吞吐:“读”性能随机器数量增加而增加;
- 低“写”吞吐:“写”性能随机器数量增加而降低(21000 次写/秒);
General-purpose Coordination Service
API
ZooKeeper API Overview:
- States: A file-system-like tree of
znodes
consist of:file names
/file content
/directories
/path names
/version number
; - Types of
znodes
:Regular
,ephemeral
,sequential
;
Operations on znodes
:
|
|
Example
Example1: Increment a number stored in ZooKeeper znode
:
需要考虑的情况:
read
return stale data; another client concurrentlywrites
value。pseudocode:1 2 3 4
while true: x, v := getData("f") if setData(x + 1, version=v): break
Example2: Simple Locks:
Pseudocode:
1 2 3 4 5 6 7
acquire(): while true: if create("lf", ephemeral=true): return success if exists("lf", watch=true): wait for notification release(): (voluntarily or session timeout) delete("lf")
Example3: Locks without Herd Effect:
上面简单锁的问题:当存在大量客户端竞争同一个锁资源时,会导致锁竞争问题:每次锁的释放都有大量客户端发起竞争请求,但是实际却只有一个客户端得到了锁资源;
Acquire Pseudocode:
1 2 3 4 5
create a "sequential" file while true: list files if no lower-numbered: return lock is acquired! if exists(next-lower-numbered, watch=true): wait for event...
这类使用 ephemeral
或 sequential
的锁被称为 Soft Lock
,因为在客户端长时间未响应时,ZooKeeper 会主动删除文件,它适用于可重复贡献问题(计算两次是可以被接受的)。
如果要使用原子锁,应该使用 ready
文件执行最小事务(删除后再添加表示执行结束)。
More
ZooKeeper is a successful design.
Topics not covered:
- Persistence;
- Details of batching and pipelining for performance;
- Fuzzy snapshots;
- Idempotent operations;
- Duplicate client request detection;