ZooKeeper

Paper: https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf

What questions does this paper shed light on?

  1. 我们是否能将 Raft 中提到的服务间合作封装成一个通用的服务?

    如果可以,API 应该设计成什么样?其他的分布式系统应该怎么使用这个服务?

  2. 我们在一个分布式系统总投入了 N 倍的机器,能够得到 N 倍的性能提升?

Performance

Raft:在添加了更多的 replicas 之后,因为 Leader 需要等待响应的机器增多,反而会降低性能。ZooKeeper 提高性能的一个基本思想是:

  • 将 Read 负载分散到各个 Replicas 机器中,使得读性能能够随机器数量线性地提升;
  • 但是在传统的 Raft 架构下的,直接读取 Replicas 会遇到一些问题:
    1. Replica may not be in majority, so may not have seen a completed write;
    2. Replica may not yet have seen a commit for a completed write;
    3. Replica may be entirely cut off from the leader;

How does ZooKeeper skin this cat?

  • By changing the definition of correctness: It allows reads to yield stale data, but otherwise preserves order.

How does ZooKeeper guarantees ordering?

  • Linearizable writes:
    1. Clients only send writes to the leader;
    2. The leader chooses an order (numbered by zxid) responses to client;
    3. Leader must preserve client write order (zxid) across leader failure, Replicas must enforce “client’s reads never go backwards in zxid order”;
  • FIFO client order:
    1. Client’s successive reads execute at non-decreasing points in the order. Each read request will carry previous zxid write order;
    2. Server may block a client’s read to wait for previous write, or sync; (If the most recent zxid that replica sees happens before the zxid client assign)

简单地说,ZooKeeper 通过以下方式显著提高了读性能:

  • 丢失“读”的一致性,但是保证了单客户端“读写”一致性(也就是说如果是我之前“写”的内容,我一定可以“读”到);
  • 这一一致性的损失其实是可以接受的,因为在大部分的系统中,我一般只关心自己的内容;

Other performance tricks:

  1. Client can send async writes to leader;
  2. Leader batches up many requests to reduce net and disk-write overhead;
  3. Fuzzy snapshots (and idempotent updates) so snapshot doesn’t stop writes;

性能分析:

  • 高“读”吞吐:“读”性能随机器数量增加而增加;
  • 低“写”吞吐:“写”性能随机器数量增加而降低(21000 次写/秒);

General-purpose Coordination Service

API

ZooKeeper API Overview:

  • States: A file-system-like tree of znodes consist of: file names/file content/directories/path names/version number;
  • Types of znodes: Regular, ephemeral, sequential;

Operations on znodes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
create(path, data, flags)
# Only first create indicates success

delete(path, version)
# delete znode if znode(path).version = version

exists(path, watch)
# set watch=true, client will receive notification if path is later created/deleted

getData(path, watch)
# same as above

setData(path, data, version)
# only set data if znode(path).version = version

getChildren(path, watch)
# get all children node, example: /x/y/z is a child node of /x/y/

sync()
# sync then read, ensures writes before sync are visible to same client's read

Example

Example1: Increment a number stored in ZooKeeper znode:

  • 需要考虑的情况:read return stale data; another client concurrently writes value。pseudocode:

    1
    2
    3
    4
    
    while true:
        x, v := getData("f")
        if setData(x + 1, version=v):
          break
    

Example2: Simple Locks:

  • Pseudocode:

    1
    2
    3
    4
    5
    6
    7
    
      acquire():
        while true:
          if create("lf", ephemeral=true): return success
          if exists("lf", watch=true): wait for notification
    
      release(): (voluntarily or session timeout)
        delete("lf")
    

Example3: Locks without Herd Effect:

  • 上面简单锁的问题:当存在大量客户端竞争同一个锁资源时,会导致锁竞争问题:每次锁的释放都有大量客户端发起竞争请求,但是实际却只有一个客户端得到了锁资源;

  • Acquire Pseudocode:

    1
    2
    3
    4
    5
    
    create a "sequential" file
    while true:
    	list files
    	if no lower-numbered: return lock is acquired!
      if exists(next-lower-numbered, watch=true): wait for event...
    

这类使用 ephemeralsequential 的锁被称为 Soft Lock,因为在客户端长时间未响应时,ZooKeeper 会主动删除文件,它适用于可重复贡献问题(计算两次是可以被接受的)。

如果要使用原子锁,应该使用 ready 文件执行最小事务(删除后再添加表示执行结束)。

More

ZooKeeper is a successful design.

Topics not covered:

  1. Persistence;
  2. Details of batching and pipelining for performance;
  3. Fuzzy snapshots;
  4. Idempotent operations;
  5. Duplicate client request detection;