第一部分：数据结构与对象

因为阅读的书已经较为过时，新的数据结构没有完全讲完，比如：
zipmap、quicklist、stream

字符串 (Simple Dynamic String)

Redis 没有使用 C 语言中的 \0 结尾的方式表示一个字符串，而是自己构建了一个结构 SDS 作为基本字符串类型。这一结构的定义和实现分别在源码的 src/sds.h、src/sds.c 这两个文件中。

结构

以最长 256 字节长度串的结构 sdshdr8 为例：

1
2
3
4
5
6
struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};

由上面的源码可以看出 SDS 结构的核心是：

len：已经使用的字符串长度；
alloc：总共分配的缓冲区长度；
SDS 为了兼容 C 语言的空字符串结尾，会额外分配一个字节的大小，如果假设字符串可以增加的大小为 rmd，那么有公式：alloc = rmd + len + 1；
buf：字符串字面量；

像大多数高级语言一样，使用简单动态字符串这种封装方式有以下好处：

获取字符串长度，时间复杂度 O(1)；
二进制安全：杜绝缓冲区溢出、可以存储二进制形式的字符串；
减少修改字符串所需要的内存重新分配次数；
兼容部分 C 字符串函数；

方法

sds 的创建核心方法可以看到源码中的 _sdsnewlen，它的函数签名如下：

1
sds _sdsnewlen(const void *init, size_t initlen, int trymalloc)

在这个函数中可以看到 sds 的内存排列方式：

1
2
3
    s = (char*)sh+hdrlen;
    fp = ((unsigned char*)s)-1;
    usable = usable-hdrlen-1;

其中 s 就是 sds 结构中的 buf，并最终赋值给了 sds 结构指针；

链表 (Linked List)

普通双端链表

Redis 链表的底层是一个无环双向链表，结构可以在 src/adlink.h、src/adlink.c 这两个文件中看到。adlink 是 A Generic Doubly Linked List 的缩写。

链表的节点使用结构 listNode：

1
2
3
4
5
typedef struct listNode {
    struct listNode *prev;
    struct listNode *next;
    void *value;
} listNode;

链表本身使用结构 list：

1
2
3
4
5
6
7
8
typedef struct list {
    listNode *head;
    listNode *tail;
    void *(*dup)(void *ptr);
    void (*free)(void *ptr);
    int (*match)(void *ptr, void *key);
    unsigned long len;
} list;

除了头节点 head、尾节点 tail、以及链表长度 len，结构中另外三个是三个函数：

dup 函数用于复制节点，free 函数用于释放节点、match 函数用于比较节点；
因为 listNode 的 value 可以指向任意对象，因此需要为 list 结构的实例设置三个用于操作节点的函数，这其实是一种多态的体现；

使用快速链表

Redis3.2，在 quicklist implementation 这个 PR 提交之后，Redis 实现链表这一数据对象的默认方式由 ziplist/linkedlist 变成了 quicklist。这种实现方式相对于之前的实现有以下的好处：

对于 linkedlist 实现的链表，容易形成很多内存碎片，查找的时间复杂度可以认为是 O(n)；
而对于 ziplist 实现的链表，每次执行插入删除操作时都会进行内存的重新分配；

快速链表 quicklist 的结构简单地说就是一个 ziplist 的 linkedlist。

快速链表结构

quicklist 本身的结构：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/* quicklist is a 40 byte struct (on 64-bit systems) describing a quicklist.
 * 'count' is the number of total entries.
 * 'len' is the number of quicklist nodes.
 * 'compress' is: 0 if compression disabled, otherwise it's the number
 *                of quicklistNodes to leave uncompressed at ends of quicklist.
 * 'fill' is the user-requested (or default) fill factor.
 * 'bookmakrs are an optional feature that is used by realloc this struct,
 *      so that they don't consume memory when not used. */
typedef struct quicklist {
    quicklistNode *head;
    quicklistNode *tail;
    unsigned long count;        /* total count of all entries in all ziplists */
    unsigned long len;          /* number of quicklistNodes */
    int fill : QL_FILL_BITS;              /* fill factor for individual nodes */
    unsigned int compress : QL_COMP_BITS; /* depth of end nodes not to compress;0=off */
    unsigned int bookmark_count: QL_BM_BITS;
    quicklistBookmark bookmarks[];
} quicklist;

其中，除了容易理解的几个参数之外：

fill 的默认值是 16，被注释为 fill factor，这是什么意思呢？可以在 .c 文件的 _quicklistNodeAllowInsert 函数最后三行得知：
1 2 3 4
if ((int)node->count < fill) return 1; else return 0;
这也就是说，单个节点的长度不超过 fill 才可以继续执行插入，所以 fill 表征的是这些 ziplist 的大小；

compress 的默认值也为 16，注释里说它是没有被压缩的节点数量。具体可以在 __quicklistCompress 这个函数中看到 compress 这个变量的具体使用方式：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
		quicklistNode *forward = quicklist->head;
    quicklistNode *reverse = quicklist->tail;
    int depth = 0;
    int in_depth = 0;
    while (depth++ < quicklist->compress) {
        quicklistDecompressNode(forward);
        quicklistDecompressNode(reverse);

        if (forward == node || reverse == node)
            in_depth = 1;

        /* We passed into compress depth of opposite side of the quicklist
         * so there's no need to compress anything and we can exit. */
        if (forward == reverse || forward->next == reverse)
            return;

        forward = forward->next;
        reverse = reverse->prev;
    }

举个例子说如果一个 quicklist 的长度为 20，compress 的值为 2，那么第 0、1、18、19 这四个 quicklistNode 则是没有压缩的；

bookmarks 与 bookmark_count 是一个为了加速访问的，面向 Redis 使用者的工具，它提供了以 O(1) 到达某个具体位置的能力。

Bookmark 的使用可以查看它结构定义位置的注释：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
/* Bookmarks are padded with realloc at the end of of the quicklist struct.
 * They should only be used for very big lists if thousands of nodes were the
 * excess memory usage is negligible, and there's a real need to iterate on them
 * in portions.
 * When not used, they don't add any memory overhead, but when used and then
 * deleted, some overhead remains (to avoid resonance).
 * The number of bookmarks used should be kept to minimum since it also adds
 * overhead on node deletion (searching for a bookmark to update). */
typedef struct quicklistBookmark {
    quicklistNode *node;
    char *name;
} quicklistBookmark;

只有在 quicklist 特别大，而又有强烈的局部访问诉求时，才建议使用它。否则它不会增加 quicklist 的内存开销。

quicklistNode 结构的定义如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
/* quicklistNode is a 32 byte struct describing a ziplist for a quicklist.
 * We use bit fields keep the quicklistNode at 32 bytes.
 * count: 16 bits, max 65536 (max zl bytes is 65k, so max count actually < 32k).
 * encoding: 2 bits, RAW=1, LZF=2.
 * container: 2 bits, NONE=1, ZIPLIST=2.
 * recompress: 1 bit, bool, true if node is temporary decompressed for usage.
 * attempted_compress: 1 bit, boolean, used for verifying during testing.
 * extra: 10 bits, free for future use; pads out the remainder of 32 bits */
typedef struct quicklistNode {
    struct quicklistNode *prev;
    struct quicklistNode *next;
    unsigned char *zl;
    unsigned int sz;             /* ziplist size in bytes */
    unsigned int count : 16;     /* count of items in ziplist */
    unsigned int encoding : 2;   /* RAW==1 or LZF==2 */
    unsigned int container : 2;  /* NONE==1 or ZIPLIST==2 */
    unsigned int recompress : 1; /* was this node previous compressed? */
    unsigned int attempted_compress : 1; /* node can't compress; too small */
    unsigned int extra : 10; /* more bits to steal for future usage */
} quicklistNode;

注释中可以看出此处用的压缩方式是 lzf，它的压缩解压缩函数在 lzf_c.c - lzf_compress/lzf_d.c - lzf_decompress 这两个函数中。

字典 (Hash Table)

字典的结构与核心方法可以在 src/dict.c 与 src/dict.h 这两个文件中看到。

结构

实现字典的结构中过程中，涉及了三个核心结构。

字典本身使用 dict 这个结构实现的：

1
2
3
4
5
6
7
typedef struct dict {
    dictType *type;
    void *privdata;
    dictht ht[2];
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    int16_t pauserehash; /* If >0 rehashing is paused (<0 indicates coding error) */
} dict;

跟 list 类似，其中 dictType 是一个为了实现多态的函数指针封装；
dictht 结构才是实现字典的核心结构；
dict 持有两个 dictht 并且定义了 rehashidx/rehashidx 这些成员变量，是为了 rehash 的性能与可用性的考虑；

哈希表使用 dictht 实现：

1
2
3
4
5
6
typedef struct dictht {
    dictEntry **table;
    unsigned long size;
    unsigned long sizemask;
    unsigned long used;
} dictht;

其中 dictEntry 是以链式存储的哈希表节点：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
typedef struct dictEntry {
    void *key;
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    struct dictEntry *next;
} dictEntry;

通过 next 字段可以看出 dictEntry 的本质是一个链表的节点，哈希表通过这种方式解决哈希冲突；

哈希算法

哈希算法是哈希表实现的重点，在代码的注释中可以看到，默认的哈希算法是 siphash，该算法在 siphash.c 这个文件中实现。

Rehash 大小

Redis 的哈希表设计巧妙之处正在于 Rehash 的方法实现。

调整大小的核心方法是：

1
int _dictExpand(dict *d, unsigned long size, int* malloc_failed)

查看这个方法的实现，可以知道函数通过 _dictNextPower 来计算预期的大小。这个方法中会根据 size 计算实际预期的大小，这个计算方法是：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
static unsigned long _dictNextPower(unsigned long size)
{
    unsigned long i = DICT_HT_INITIAL_SIZE;	// 4
    if (size >= LONG_MAX) return LONG_MAX + 1LU;
    while(1) {
        if (i >= size)
            return i;
        i *= 2;
    }
}

查看这个方法的调用，可以得到在什么条件下会触发 Rehash：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
/* Expand the hash table if needed */
static int _dictExpandIfNeeded(dict *d) {
  	/// ...
		/* If we reached the 1:1 ratio, and we are allowed to resize the hash
     * table (global setting) or we should avoid it but the ratio between
     * elements/buckets is over the "safe" threshold, we resize doubling
     * the number of buckets. */
    if (d->ht[0].used >= d->ht[0].size &&
        (dict_can_resize ||
         d->ht[0].used/d->ht[0].size > dict_force_resize_ratio) &&
        dictTypeExpandAllowed(d))
    {
        return dictExpand(d, d->ht[0].used + 1);
    }
  	/// ...
}

而 _dictExpandIfNeeded 则会在每次使用 _dictKeyIndex 插入新的 key 时调用。

Rehash 流程

如果使用普通的 rehash 方案全局调整并且复制，在数据量较大的情况下，会导致服务器短暂的宕机。

因此 Redis 设计了一个渐进式的 Rehash 方式，渐进式地多次完成，而不是集中地一次完成。具体的，在 Rehash 的过程中，dict 会存储两份 dictht 数据，所有的增删改查操作都会在这两个表中进行。

Rehash 的原子步骤是源码中的 dictRehash 这个方法，它的签名和注释如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
/* Performs N steps of incremental rehashing. Returns 1 if there are still
 * keys to move from the old to the new hash table, otherwise 0 is returned.
 *
 * Note that a rehashing step consists in moving a bucket (that may have more
 * than one key as we use chaining) from the old to the new hash table, however
 * since part of the hash table may be composed of empty spaces, it is not
 * guaranteed that this function will rehash even a single bucket, since it
 * will visit at max N*10 empty buckets in total, otherwise the amount of
 * work it does would be unbound and the function may block for a long time. */
int dictRehash(dict *d, int n)

在这个方法中，以 rehashidx 为标记，把之后连续的 n 个记录从 ht[0] 迁移到 ht[1]。每当一个元素从 ht[0] 迁移到 ht[1]，函数会更改 used 字段同时维持 size 字段不变，ht[0] 中对应下标位置的指针将会被指向为 NULL。

查看这个函数的调用栈可以分析 Redis 具体是如何执行 Rehash 操作的。

dictRehashMilliseconds：函数接收一个以 ms 为单位的时间参数，函数内部每次 “迁移 100 个元素”为原子操作，进行执行时间不超过这个上限的 rehash。
在 server.c 文件中通过调用这个函数进行 rehash 操作，每 100ms 会占用小于 1ms 时间进行 rehash。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
/* Rehash in ms+"delta" milliseconds. The value of "delta" is larger 
 * than 0, and is smaller than 1 in most cases. The exact upper bound 
 * depends on the running time of dictRehash(d,100).*/
int dictRehashMilliseconds(dict *d, int ms) {
    if (d->pauserehash > 0) return 0;
    long long start = timeInMilliseconds();
    int rehashes = 0;
    while(dictRehash(d,100)) {
        rehashes += 100;
        if (timeInMilliseconds()-start > ms) break;
    }
    return rehashes;
}

_dictRehashStep：对单步迁移操作的封装。
该函数在添加删除查找等方法被执行时会用到，我感觉这么做的好处是可以将计算压力分摊到每次访问请求中，而基本不会影响每次访问的查询速度。同时间接地实现了“越被频繁使用的字典，计算优先级越高”；

PS：美团技术团队针对 Redis-Rehash 这一方向进行了优化：https://tech.meituan.com/2018/07/27/redis-rehash-practice-optimization.html

`zipmap`

zipmap 名字叫 zip，实际上并没有进行压缩操作，它将键值对连续存储，省去了许多管理 map 的指针结构，它的大致结构可以在 zipmapNew 这个函数中看到：

1
2
3
4
5
6
7
unsigned char *zipmapNew(void) {
    unsigned char *zm = zmalloc(2);

    zm[0] = 0; /* Length */
    zm[1] = ZIPMAP_END;
    return zm;
}

跳跃表 (Skip List)

跳跃表最初在论文《Skip Lists: A Probabilistic Alternative to Balanced Trees》中提出，在 Redis 中的使用是这个数据结构的高光时刻。

Redis 中的实现

跳跃表定义在 server.h 这个文件中，它是有序集合的底层实现之一。

跳跃表本身的实现结构名称为 zskiplist：

1
2
3
4
5
typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;

其中 zskiplistNode 主要由字符值、分值、多层跳跃指针三个部分构成：

1
2
3
4
5
6
7
8
9
typedef struct zskiplistNode {
    sds ele;
    double score;
    struct zskiplistNode *backward;
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned long span;
    } level[];
} zskiplistNode;

完美的跳跃表可以实现 O(log n) 的查找时间复杂度。

参数 p

跳跃表的实现有一个关键参数 p，在 Redis 中它被定义为默认值 1/4：

1
#define ZSKIPLIST_P 0.25      /* Skiplist P = 1/4 */

它有以下的含义：

在实现跳跃表的程序中，它表征新插入的元素是否要新加一层的概率。这也就是说：
- 一个新插入的元素有一层跳跃指针概率为 p，两层的概率为 $p^2$，后是 $p^3$；
第 x+1 层指针数量的数学期望，与第 x 层指针数量的数学期望，的比；

查找时间复杂度

设函数 f(x) 为：长度为 n 的跳跃表在查找第 x 项时需要经过的总路径数量。

那么容易得到方程：

$\displaystyle f(x) = f(px) + \frac{1}{2p}$

其中，因为这个差值是一个介于 0 与 $\displaystyle \frac{1}{p}$ 之间均匀分布的变量，所以期望是 $\displaystyle \frac{1}{2p}$。

因为 p<1，所以：

$\displaystyle \lim_{i \rightarrow +\infin} f(p^{i}x) = f(0) = 0$

其中 $f(0) = 0$ ，表示如果要查找的元素位于第一个元素，则不需要进行任何的查找操作。

因此对上面的方程进行迭代，在长度为 n 的跳跃表中，最高层数为 $log_{\frac{1}{p}} n$，所以可以得到这个函数方程的解：

$\displaystyle f(x) = 0 + \frac{1}{2p} * log_{\frac{1}{p}}n = - \frac{ln(n)}{2p * ln(p)}$

对分母求导可以得到，考虑时间复杂度时的最合适的 p 是 $\displaystyle \frac{1}{e}$。

期望空间复杂度

根据 zskiplistNode 的结构，我们基本可以认为一个 node 的大小跟一个 level 的大小是相同的，我们设这个大小为单位 1，于是可以得到空间复杂度计算方程：

$\displaystyle Size = n + np + np^2 + \dots + n p^{log_{\frac{1}{p}} n}$

乘 p 做差可得：

$\displaystyle Size = \frac{1}{1 - p} *(n - n * n^{-1} * p) = \frac{n - p}{1- p}$

可见 p 于区间 0-1 内，与 Size 是成正比关系的，所以在考虑 $\frac{1}{2}$ 与 $\frac{1}{4}$ 这两个理论实践复杂度时间取值中，Redis 选择了 $\frac{1}{4}$。

整数集合（Integer Set）

整数集合定义在 intset.c 与 inset.h 这两个文件中，它是用于保存整数值的集合抽象数据结构：即它保存内容为整数值，并且集合内不会出现重复元素。它是集合键的底层实现之一。

结构

intset 的结构定义相当简单：

1
2
3
4
5
typedef struct intset {
    uint32_t encoding;
    uint32_t length;
    int8_t contents[];
} intset;

其中 encoding 可以认为是高级语言中的枚举值，它定义在 intset.c 文件中：

1
2
3
#define INTSET_ENC_INT16 (sizeof(int16_t))
#define INTSET_ENC_INT32 (sizeof(int32_t))
#define INTSET_ENC_INT64 (sizeof(int64_t))

contents 实际存储的数据类型由这个 encoding 决定，它会在访问时进行强制类型转化。

升级

通过不同的 encoding 存储有以下好处：

提升灵活性，因为 C 语言的特性，不能用一个结构同时存储多个类型的数据；
节约内存，不需要为了大量小数字开辟大量高位内存；

但是因为 inset 的内容会动态变化，在一些场景下会触发升级（encoding 从小的类型转变为大的类型）。

升级：intsetUpgradeAndAdd，在插入一个比类型值还大的数字时会触发升级。函数签名：

1
2
/* Upgrades the intset to a larger encoding and inserts the given integer. */
static intset *intsetUpgradeAndAdd(intset *is, int64_t value)

PS：为什么 inset 不像 hashtable 设计一个渐进性的升级方案。

压缩列表（Zip List）

压缩列表是为了节约内存而诞生的一种数据结构，它的本质是一个通过特殊编码方式存储的双向链表。

当一个列表值中只包含少量的列表项，并且每个列表项要么就是小的整数值，要么就是比较短的字符串。那么Redis 底层就是使用压缩列表来做列表键的底层实现。

结构

压缩列表的详细数据结构在源代码的 ziplist.c 文件开头有长达 200 行的注释解释，大体概括就是下面的结构：

1
<zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>

其中：

zlbytes 指明了包括它自己在内的整个 ziplist 字节大小；
zltail 指明了最后一个 entry 的相对偏移；
zllen 指明了数组长度大小，即之后跟的 entry 的数量；
zlend 是一个表征压缩列表结尾的字节，固定的值 0xFF；

entry 的内部也有许多通过编码降低内存的设计，大体概述就是下面的结构：

1
<prevlen> <encoding> <entry-data>

在不同的情况下，这一基本结构有不同的变体。

比如对于小整数，encoding，自身就可以表示数值，此时的结构为：

1
<prevlen> <encoding>

prevlen 通常只用一个字节表示，如果长度大于等于 254，则第一个字节置位为 0xFE：

1
0xFE <4 bytes unsigned little endian prevlen> <encoding> <entry>

操作

压缩列表是一个“时间换空间”的设计，所以只能用于小列表项。

下标访问、entry 前指、entry 后指、获取字节数、获取列表大小等查询操作都是 O(1)；
插入、删除等涉及到大小变化的更新操作时间复杂度都是平均 O(N)；
插入、删除可能会引发连锁更新，所以最坏的时间复杂度是 O(N^2)

什么是连锁更新？

当 ziplist 保存了大量长度为 253 长度的 entry 时，如果在第一个位置插入了一个长度大于 253 的元素，则会导致之后的每一个 entry 执行连锁更新，时间复杂度 O(N^2)

对象

Redis 暴露给用户的并不是上面列举的这些“基本数据结构”，而是五个对象：

字符串对象（无前缀）、列表对象 (l)、哈希对象 (h)、集合对象 (s)、有序集合对象 (z)；

Redis 通过不同的编码方式表征具体的底层实现。

结构

redisObject 声明在文件 rio.h 中，实际定义在 server.c 中：

1
2
3
4
5
6
7
8
9
typedef struct redisObject {
    unsigned type:4;
    unsigned encoding:4;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    int refcount;
    void *ptr;
} robj;

其中 type 就是在前面列举的五个基本数据类型：

1
2
3
4
5
6
/* The actual Redis Object */
#define OBJ_STRING 0    /* String object. */
#define OBJ_LIST 1      /* List object. */
#define OBJ_SET 2       /* Set object. */
#define OBJ_ZSET 3      /* Sorted set object. */
#define OBJ_HASH 4      /* Hash object. */

编码

encoding 表征这这个类型的对象底层使用的数据结构，通常一个类型只会使用两个底层数据结构：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/* Objects encoding. Some kind of objects like Strings and Hashes can be
 * internally represented in multiple ways. The 'encoding' field of the object
 * is set to one of this fields for this object. */
#define OBJ_ENCODING_RAW 0     /* Raw representation */
#define OBJ_ENCODING_INT 1     /* Encoded as integer */
#define OBJ_ENCODING_HT 2      /* Encoded as hash table */
#define OBJ_ENCODING_ZIPMAP 3  /* Encoded as zipmap */
#define OBJ_ENCODING_LINKEDLIST 4 /* No longer used: old list encoding. */
#define OBJ_ENCODING_ZIPLIST 5 /* Encoded as ziplist */
#define OBJ_ENCODING_INTSET 6  /* Encoded as intset */
#define OBJ_ENCODING_SKIPLIST 7  /* Encoded as skiplist */
#define OBJ_ENCODING_EMBSTR 8  /* Embedded sds string encoding */
#define OBJ_ENCODING_QUICKLIST 9 /* Encoded as linked list of ziplists */
#define OBJ_ENCODING_STREAM 10 /* Encoded as a radix tree of listpacks */

具体到五个对象：

字符串对象：int、raw、embstr；
列表对象：ziplist、linkedlist；3.2 后 quicklist 成为其唯一编码模式；
哈希对象：ziplist、hashtable；
集合对象：intset、hashtable；
有序集合对象：ziplist、skiplist；

内存管理

针对于对象，Redis 使用“引用计数”的内存回收机制，并为 0-10000 这些数字设置了默认的对象共享。