Primary/Backup Replication
Two main replication approaches:
- State Transfer: Primary replica executes and sends
new
state to backups machine; - Replicated State Machine: Primary just pass the raw external event to backups. Mostly used by recent industry and papers;
Overview:
- VM-FT consist of two machine:
primary
andbackup
. Primary deals with all external events and replicates it to backup through “logging channel”; - VM-FT emulates a local disk interface through two remote disk server.
Log Entry
events
can’t determine all situation, FT must handle the following divergence:
- 指令本身的差异:Most instructions execute identically;
- 机器所处外部的信号差异:Input from external world: network packet, DMA data, OS interrupt, etc;
- 并不是所有指令都是状态指令(唯一输入唯一输出的纯函数)。比如:读取当前时间、随机数发生器(某种意义上说与前者是一类);
- 并发与多核:Parallelism and multi-core races;
为了使得 pirmary
和 backup
的状态完全一样,我们就必须要处理上面列举的一些异常情况,在它们通信时将这些信息传递出去使得它们执行的代码完全一样。
So the log entry
who transfers message between Primary
and Backup
should contain these message below:
- Instruction number, interrupt type, interrupt data;
- Example:
- When executing the 120120(instruction number) instruction since boot;
- Program get network packet (interrupt type);
- Carrying a tcp hand shake ACK (interrupt data);
Timer Interrupt
How does FT handle timer interrupts?
Goal: Make sure primary and backup should see interrupt at totally same situation;
Primary
should do:
- FT fields the timer interrupt;
- FT reads instruction number X from CPU;
- FT send instruction number X on the logging channel to
Backup
; - FT delivers interrupt to
Primary
and resume executing;
Backup
should do:
- Ignores its own timer hardware;
Backup
see instruction number X from logging channel before the exact instruction executed;- FT tells CPU to “interrupt me at instruction X”;
- FT mimics a timer interrupt to
Backup
;
Network Interrupt
How does FT handle arrival of network packet?
Goal: Exactly same as timer interrupt, with data designating.
Primary
should do:
- Boosting: Tells NIC (Network Interface Controller) to copy packet data into FT’s private “bounce buffer”;
- At some point, NIC does DMA then interrupt:
- FT pause the primary;
- FT copies the “bound buffer” into
Primary
’s memory; - FT simulates a NIC interrupt in
Primary
; - FT send “packet data” and “instruction number” to
Backup
through log channel;
Backup
should do:
Backup
received instruction number from log stream;- FT tells CPU to interrupt at instruction X;
- FT copies the data to backup memory and similates NIC interrupt in
Backup
;
Bounce Buffer
What bounce buffer?
- Bounce buffer is a FT’s memory area that store network packet data;
Why bounce buffer?
- We want the data to appear in memory at exactly the same point in execution of the
Primary
andBackup
;
More Rule
Output Rule
Suppose we encountered the following situation:
Primary
crashes jsut after sending the reply to client;Backup
doesn’t receive any event fromPrimary
because it has crashed;
Output rule was brought up to deal with this:
Primary
should repsonse to client after receivingACKnowledgement
fromBackup
;
Split Brain
Suppose we encounter the following situation:
- Network between
Primary
andBackup
has been cut over; - So both machine think the other one is dead, and think it’s the
Primary
and stop sending logging event;
This is a common problem called “split brain”. FT creat a center server support atomic test-and-set
instruction, machine who get flag can become Primary
.
Summary
When might FT be attractive?
Critical but low-intensity services: name server;
Services whose software is not convenient to modify;
What about replication for high-throughput services?
- Recommend: Applicative-level replicated state machines;
- Example: Database state machine, database only support a limit set of command which is easier to transfer message;
- Result: less fine-grained synchronization (更细粒度的同步), less overhead;