VMware FT(Fault-Tolerant)

Primary/Backup Replication

Two main replication approaches:

  1. State Transfer: Primary replica executes and sends new state to backups machine;
  2. Replicated State Machine: Primary just pass the raw external event to backups. Mostly used by recent industry and papers;

Overview:

  • VM-FT consist of two machine: primary and backup. Primary deals with all external events and replicates it to backup through “logging channel”;
  • VM-FT emulates a local disk interface through two remote disk server.

Log Entry

events can’t determine all situation, FT must handle the following divergence:

  1. 指令本身的差异:Most instructions execute identically;
  2. 机器所处外部的信号差异:Input from external world: network packet, DMA data, OS interrupt, etc;
  3. 并不是所有指令都是状态指令(唯一输入唯一输出的纯函数)。比如:读取当前时间、随机数发生器(某种意义上说与前者是一类);
  4. 并发与多核:Parallelism and multi-core races;

为了使得 pirmarybackup 的状态完全一样,我们就必须要处理上面列举的一些异常情况,在它们通信时将这些信息传递出去使得它们执行的代码完全一样。

So the log entry who transfers message between Primary and Backup should contain these message below:

  • Instruction number, interrupt type, interrupt data;
  • Example:
    • When executing the 120120(instruction number) instruction since boot;
    • Program get network packet (interrupt type);
    • Carrying a tcp hand shake ACK (interrupt data);

Timer Interrupt

How does FT handle timer interrupts?

Goal: Make sure primary and backup should see interrupt at totally same situation;

Primary should do:

  1. FT fields the timer interrupt;
  2. FT reads instruction number X from CPU;
  3. FT send instruction number X on the logging channel to Backup;
  4. FT delivers interrupt to Primary and resume executing;

Backup should do:

  1. Ignores its own timer hardware;
  2. Backup see instruction number X from logging channel before the exact instruction executed;
  3. FT tells CPU to “interrupt me at instruction X”;
  4. FT mimics a timer interrupt to Backup;

Network Interrupt

How does FT handle arrival of network packet?

Goal: Exactly same as timer interrupt, with data designating.

Primary should do:

  • Boosting: Tells NIC (Network Interface Controller) to copy packet data into FT’s private “bounce buffer”;
  • At some point, NIC does DMA then interrupt:
    1. FT pause the primary;
    2. FT copies the “bound buffer” into Primary’s memory;
    3. FT simulates a NIC interrupt in Primary;
    4. FT send “packet data” and “instruction number” to Backup through log channel;

Backup should do:

  1. Backup received instruction number from log stream;
  2. FT tells CPU to interrupt at instruction X;
  3. FT copies the data to backup memory and similates NIC interrupt in Backup;

Bounce Buffer

What bounce buffer?

  • Bounce buffer is a FT’s memory area that store network packet data;

Why bounce buffer?

  • We want the data to appear in memory at exactly the same point in execution of the Primary and Backup;

More Rule

Output Rule

Suppose we encountered the following situation:

  • Primary crashes jsut after sending the reply to client;
  • Backup doesn’t receive any event from Primary because it has crashed;

Output rule was brought up to deal with this:

  • Primary should repsonse to client after receiving ACKnowledgement from Backup;

Split Brain

Suppose we encounter the following situation:

  • Network between Primary and Backup has been cut over;
  • So both machine think the other one is dead, and think it’s the Primary and stop sending logging event;

This is a common problem called “split brain”. FT creat a center server support atomic test-and-set instruction, machine who get flag can become Primary.

Summary

When might FT be attractive?

  • Critical but low-intensity services: name server;

  • Services whose software is not convenient to modify;

What about replication for high-throughput services?

  • Recommend: Applicative-level replicated state machines;
    • Example: Database state machine, database only support a limit set of command which is easier to transfer message;
  • Result: less fine-grained synchronization (更细粒度的同步), less overhead;