ZooKeeper

Paper: https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf

What questions does this paper shed light on?

我们是否能将 Raft 中提到的服务间合作封装成一个通用的服务？
如果可以，API 应该设计成什么样？其他的分布式系统应该怎么使用这个服务？
我们在一个分布式系统总投入了 N 倍的机器，能够得到 N 倍的性能提升？

Performance

Raft：在添加了更多的 replicas 之后，因为 Leader 需要等待响应的机器增多，反而会降低性能。ZooKeeper 提高性能的一个基本思想是：

将 Read 负载分散到各个 Replicas 机器中，使得读性能能够随机器数量线性地提升；
但是在传统的 Raft 架构下的，直接读取 Replicas 会遇到一些问题：
1. Replica may not be in majority, so may not have seen a completed write;
2. Replica may not yet have seen a commit for a completed write;
3. Replica may be entirely cut off from the leader;

How does ZooKeeper skin this cat?

By changing the definition of correctness: It allows reads to yield stale data, but otherwise preserves order.

How does ZooKeeper guarantees ordering?

Linearizable writes:
1. Clients only send writes to the leader;
2. The leader chooses an order (numbered by zxid) responses to client;
3. Leader must preserve client write order (zxid) across leader failure, Replicas must enforce “client’s reads never go backwards in zxid order”;
FIFO client order:
1. Client’s successive reads execute at non-decreasing points in the order. Each read request will carry previous zxid write order;
2. Server may block a client’s read to wait for previous write, or sync; (If the most recent zxid that replica sees happens before the zxid client assign)

简单地说，ZooKeeper 通过以下方式显著提高了读性能：

丢失“读”的一致性，但是保证了单客户端“读写”一致性（也就是说如果是我之前“写”的内容，我一定可以“读”到）；
这一一致性的损失其实是可以接受的，因为在大部分的系统中，我一般只关心自己的内容；

Other performance tricks:

Client can send async writes to leader;
Leader batches up many requests to reduce net and disk-write overhead;
Fuzzy snapshots (and idempotent updates) so snapshot doesn’t stop writes;

性能分析：

高“读”吞吐：“读”性能随机器数量增加而增加；
低“写”吞吐：“写”性能随机器数量增加而降低（21000 次写/秒）；

General-purpose Coordination Service

API

ZooKeeper API Overview：

States: A file-system-like tree of znodes consist of: file names/file content/directories/path names/version number;
Types of znodes: Regular, ephemeral, sequential;

Operations on znodes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
create(path, data, flags)
# Only first create indicates success

delete(path, version)
# delete znode if znode(path).version = version

exists(path, watch)
# set watch=true, client will receive notification if path is later created/deleted

getData(path, watch)
# same as above

setData(path, data, version)
# only set data if znode(path).version = version

getChildren(path, watch)
# get all children node, example: /x/y/z is a child node of /x/y/

sync()
# sync then read, ensures writes before sync are visible to same client's read

Example

Example1: Increment a number stored in ZooKeeper znode:

需要考虑的情况：read return stale data; another client concurrently writes value。pseudocode：

1
2
3
4
while true:
    x, v := getData("f")
    if setData(x + 1, version=v):
      break

Example2: Simple Locks:

Pseudocode:

1
2
3
4
5
6
7
  acquire():
    while true:
      if create("lf", ephemeral=true): return success
      if exists("lf", watch=true): wait for notification

  release(): (voluntarily or session timeout)
    delete("lf")

Example3: Locks without Herd Effect:

上面简单锁的问题：当存在大量客户端竞争同一个锁资源时，会导致锁竞争问题：每次锁的释放都有大量客户端发起竞争请求，但是实际却只有一个客户端得到了锁资源；

Acquire Pseudocode:

1
2
3
4
5
create a "sequential" file
while true:
	list files
	if no lower-numbered: return lock is acquired!
  if exists(next-lower-numbered, watch=true): wait for event...

这类使用 ephemeral 或 sequential 的锁被称为 Soft Lock，因为在客户端长时间未响应时，ZooKeeper 会主动删除文件，它适用于可重复贡献问题（计算两次是可以被接受的）。

如果要使用原子锁，应该使用 ready 文件执行最小事务（删除后再添加表示执行结束）。

ZooKeeper is a successful design.

Topics not covered:

Persistence;
Details of batching and pipelining for performance;
Fuzzy snapshots;
Idempotent operations;
Duplicate client request detection;

ZooKeeper

Performance

General-purpose Coordination Service

API

Example

More