Raft

论文地址:https://raft.github.io/raft.pdf

Split Brain

之前学习过的系统中:

  • MapReduce:replicates computation but relies on a single master to organize;
  • GFS:replicates data but relies on the master to pick primaries;
  • VMwareFT:replicates service but relies on test-and-set to pick primary;

上面这些系统在做核心决策的时候都需要依赖单一的机器,也就是通过单一的机器避免 Split Brain 问题。

为什么会出现 Split Brain 问题呢?

  • 根本原因在于计算机之间无法区分 server crashednetwork broken 这两个情况。
  • 比如:一个系统中的一个机器 A 无法与另一个机器 B 通信了,如果 A 认为 B 是宕机了而实际上是网络错误,反之亦然。A/B 就分裂成了两个独立的服务,它们都认为对方宕机而自己继续服务客户端请求;

“Split Brain” caused by “network partition” seemed insurmountable for a long time:

  • Previous Solution:

    1. A outside agent (a human) to decide when to cut over;
    2. A single perfectly reliable server (as in previous chapter);
    3. A perfectly reliable network, so no response will be congruent to crashed;
  • But all we face are single points of failure (not desirable);

Raft copes with partition by “Majority Vote”。多数投票机制并不是一个可以配置的 50%,不是随便想出来的一个数字,它里面包含了许多微妙的思想:

  • At most one partition can have a majority (eg: 2f+1 can tolerate f failed servers)
  • Any two terms must intersect (抽屉原理);
  • Any command log in history must exist in one term (同样是抽屉原理);

Before Raft, two partition-tolerant replication schemes were invented around 1990:

  • Paxos and View-Stamped Replication

Overview

Time Diagram of Raft System handles client-command:

  • Client:

    • Clients only interact with the leader as long as the Leader stays up;
    • Clients can’t see follower states or logs;
    • Reason: distributed system should mimicing a single server;
  • Leader:

    1. Client sends command to Leader;
    2. Leader add command to log;
    3. Leader sends AppendEntries RPCs to Followers;
    4. Leader waits for replies from a bare majority (including itself) entry is committed;
    5. Leader executes the command, replies to client;
    6. Commit info will send back to Followers in next AppendEntries RPCs;
  • Followers:

    1. Follower adds command to log once it receive the AppendEntries;
    2. Follower execute the command only if it know the Leader has executes the command;

Why the logs?

  • The log orders the commands, to ensure replicas agree on a single execution order;
  • In case: Leader must re-send to Followers;
  • In case mache crash: persistent commands can be replay after reboot;

Some replicas may lag in logs, but whey’ll eventually coverage to be identical;

Leader Election

Leader ensures all replicas execute the same commands in the same order.

Different leader mark by term number, any term number will only have one leader at most.

When election start?

  • Follower doesn’t hear from current Leader for an “election timeout”, increments local currentTerm, try to collect votes;

What happens if an elections doesn’t succeed?

  • Wait for another “election timeout”, a new election with higher term takes precedence.

Split votes problem:

  • 问题:
    • 如果所有 Candidates 同时到达 “election timeout”,那么它们都将把票投给自己,那么这个系统将永远无法选举出新的 Leader;
  • 解决方案:
    • Each server picks a random election timeout, one will choose lowest random delay (randomized delays are a common pattern in network protocols);
    • Least timeout: A few heartbeat intervals to avoid needless elections, short enough to react quickly to failure (avoid long pauses);
    • Least interval: Long enough to let one candidate succeed before next starts;
    • Max timeout: The max tolerant time system stop;

Raft Log

Storage format:

  • 每个服务器维护自己的操作日志,日志的的内容是操作本身;
  • 每个日志都被放在一个递增的 log slot 中,同时还保存了这个日志发生的 term(也就标记了这个日志是由哪个 Leader 处理的);

新上任的 Leader 发现日志不一致,怎么做?

  • Raft forces agreement by having Followers adopt new Leader’s log;
  • 具体策略:
    1. Leader 每次发送 AppendEntries 包会带上日志的 prevLogIndexprevLogTerm
    2. Follower 发现 prevLog 值不匹配时会响应 false
    3. Leader 收到响应失败时会将 prev 倒退一次再次发送,一致循环直到响应 true
    4. Follower 匹配到 prevLog 值时会删除这个 log 后的所有日志后重新接受 Leader 的日志;
  • 以上策略只是一个简化版本,显然我们可以通过更少的交互 rollback quicker;

如何确保 Follower 的冗余日志是可以删除的,这等价于 Leader 的日志是完整的:

  • 通过选举策略确保 Leader 的日志是完整的,只有在以下情况下,服务器的 RequestVote 才会给 Candidate
    1. Candidate has higher term in last log entry;
    2. Candidate has same last term and (same length or longer log);
  • 这个策略确保了被竞选上的 Leader 有着比一半以上服务器更新或相等的日志;

Persistence

What would we like to happen after a server crashes?

  • Raft system should continue handle client request with one missing server.
  • Failed server must be repaired soon to avoid dipping below a majority.

The strategy that Raft uses is to reboot crashed server.

Raft server persistent value for restarting:

  1. log[]: If a server was in leader’s majority for committing an entry, it must remember entry despite reboot, so any future leader is guaranteed to see the committed log entry. 也就是说我们从每个服务器的所有持久化日志中,能够准确的还原出“分布式系统”完成的对外响应日志;
  2. voteFor: To prevent a server vote for multiple candidates in one term;
  3. currentTerm: To ensure terms only increase;

How does the service recover its state after a crash+reboot?

  • Easy approach: start with empty state and re-play all logs;
  • Faster optimize: use Raft snapshot and replay just the tail of log;

Log Compaction && Snapshots

Discard log:

  • Problem: Log will get to be huge, which will take a long time to re-play on reboot or send to a new server;

  • Solution: Service periodically creates persistent “snapshot”, and discard previous log.

    Notice: Can’t discard entries that un-executed or un-committed.

What if follower’s log ends before leader’s log starts?

  • Leader reparirs that follower with InstallSnapshot RPC instead of AppendEntries RPC

Linearizability

Linearizability (or “strong” consistentcy) is the most common and instuitive definition formalizes behavior expected of a single server.

Definition: An execution history is linearizable if:

  • One can find a total order of all operations, that matches real-time;
  • And in which each read sees the value from the write preceding it in the order;

Example see: http://nil.csail.mit.edu/6.824/2020/notes/l-raft2.txt

Duplicate RPC Detection