Primary/Backup Replication
Two main replication approaches:
- State Transfer: Primary replica executes and sends
newstate to backups machine; - Replicated State Machine: Primary just pass the raw external event to backups. Mostly used by recent industry and papers;
Overview:
- VM-FT consist of two machine:
primaryandbackup. Primary deals with all external events and replicates it to backup through “logging channel”; - VM-FT emulates a local disk interface through two remote disk server.
Log Entry
events can’t determine all situation, FT must handle the following divergence:
- 指令本身的差异:Most instructions execute identically;
- 机器所处外部的信号差异:Input from external world: network packet, DMA data, OS interrupt, etc;
- 并不是所有指令都是状态指令(唯一输入唯一输出的纯函数)。比如:读取当前时间、随机数发生器(某种意义上说与前者是一类);
- 并发与多核:Parallelism and multi-core races;
为了使得 pirmary 和 backup 的状态完全一样,我们就必须要处理上面列举的一些异常情况,在它们通信时将这些信息传递出去使得它们执行的代码完全一样。
So the log entry who transfers message between Primary and Backup should contain these message below:
- Instruction number, interrupt type, interrupt data;
- Example:
- When executing the 120120(instruction number) instruction since boot;
- Program get network packet (interrupt type);
- Carrying a tcp hand shake ACK (interrupt data);
Timer Interrupt
How does FT handle timer interrupts?
Goal: Make sure primary and backup should see interrupt at totally same situation;
Primary should do:
- FT fields the timer interrupt;
- FT reads instruction number X from CPU;
- FT send instruction number X on the logging channel to
Backup; - FT delivers interrupt to
Primaryand resume executing;
Backup should do:
- Ignores its own timer hardware;
Backupsee instruction number X from logging channel before the exact instruction executed;- FT tells CPU to “interrupt me at instruction X”;
- FT mimics a timer interrupt to
Backup;
Network Interrupt
How does FT handle arrival of network packet?
Goal: Exactly same as timer interrupt, with data designating.
Primary should do:
- Boosting: Tells NIC (Network Interface Controller) to copy packet data into FT’s private “bounce buffer”;
- At some point, NIC does DMA then interrupt:
- FT pause the primary;
- FT copies the “bound buffer” into
Primary’s memory; - FT simulates a NIC interrupt in
Primary; - FT send “packet data” and “instruction number” to
Backupthrough log channel;
Backup should do:
Backupreceived instruction number from log stream;- FT tells CPU to interrupt at instruction X;
- FT copies the data to backup memory and similates NIC interrupt in
Backup;
Bounce Buffer
What bounce buffer?
- Bounce buffer is a FT’s memory area that store network packet data;
Why bounce buffer?
- We want the data to appear in memory at exactly the same point in execution of the
PrimaryandBackup;
More Rule
Output Rule
Suppose we encountered the following situation:
Primarycrashes jsut after sending the reply to client;Backupdoesn’t receive any event fromPrimarybecause it has crashed;
Output rule was brought up to deal with this:
Primaryshould repsonse to client after receivingACKnowledgementfromBackup;
Split Brain
Suppose we encounter the following situation:
- Network between
PrimaryandBackuphas been cut over; - So both machine think the other one is dead, and think it’s the
Primaryand stop sending logging event;
This is a common problem called “split brain”. FT creat a center server support atomic test-and-set instruction, machine who get flag can become Primary.
Summary
When might FT be attractive?
Critical but low-intensity services: name server;
Services whose software is not convenient to modify;
What about replication for high-throughput services?
- Recommend: Applicative-level replicated state machines;
- Example: Database state machine, database only support a limit set of command which is easier to transfer message;
- Result: less fine-grained synchronization (更细粒度的同步), less overhead;