Introduction

Sometimes in business, it is necessary to use some unique ID to record the identification of one of our data. The most common ones are none other than the following: UUID, database self-incrementing primary key, Redis’ Incr command, and other methods to get a unique value. Below we talk about the advantages and disadvantages of each of them in order to introduce our distributed snowflake algorithm.

UUID

First is the UUID , which is composed of 128-bit binary, typically converted to hexadecimal, and then represented as a String. To ensure the uniqueness of the UUID, the specification defines elements including NIC MAC address, timestamp, namespace, random or pseudo-random numbers, and timing, and an algorithm to generate the UUID from these elements.

There are five versions of UUIDs.

  • Version 1: Timestamp and mac address based
  • Version 2: Based on timestamp, mac address and POSIX UID/GID.
  • Version 3: Based on MD5 hash algorithm
  • Version 4: Based on random numbers
  • Version 5: Based on SHA-1 hash algorithm

Pros and Cons of UUID:

The advantages are simple code and better performance. The disadvantage is that it is not sorted and cannot be guaranteed to increment in order; secondly, it is too long and takes up more space in the storage database, which is not conducive to retrieval and sorting.

database auto-increment primary key

If you use mysql database, then by setting the primary key to auto_increment is the easiest way to achieve monotonically incremented unique ID, and it is also convenient for sorting and indexing.

But the disadvantage is also obvious, because of excessive reliance on the database, then limited by the performance of the database will lead to concurrency is not high; and then if the amount of data is too large, then it will bring problems to the sub-base and table; and if the database is down, then this function is not available.

Redis

Redis is now an indispensable part of many projects. In Redis, there are two commands, Incr and IncrBy, which are single-threaded and can be used to ensure atomicity and achieve the goal of generating unique values with good performance.

However, in Redis, even with AOF and RDB, there is still data loss, which may result in duplicate IDs; furthermore, there is a dependency on Redis, which can affect ID generation if it is unstable.

Snowflake

The above analysis one by one finally leads to our distributed snowflake algorithm Snowflake , which was first used internally by twitter as a unique ID generation algorithm in a distributed environment. It was open sourced in 2014. The open source version is written by scala, you can find this version again at the address.

https://github.com/twitter-archive/snowflake/tags

It has the following features.

  1. can meet the high concurrency distributed system environment ID non-duplication.
  2. based on timestamps, it can guarantee the basic orderly increment.
  3. does not rely on third-party libraries or middleware.

Implementation Principle

Snowflake structure is a 64bit data of type int64. It is as follows.

Snowflake

position size role
0~11bit 12bits Sequence number, used to generate different IDs within the same millisecond, can record 4095
12~21bit 10bits 10bit is used to record machine IDs, a total of 1024 machines can be recorded
22~62bit 41bits is used to record the time stamp, here can record 69 years
63bit 1bit symbol bits, no processing

The above is just a general standard to classify 64bit, and the general situation can be adjusted according to your business situation. For example, the current business only has about 10 machines and is expected to increase to triple digits in the future, and requires multi-room deployment, and the QPS will grow to a million within a few years.

Then for the million QPS divided equally among 10 machines, each machine can take 100,000 requests, and the 12-bit serial number is fully sufficient. For the future will increase to three digit machines, and the need for multi-room deployment needs we only need to split the 10 bits of work id, split 3 bits to represent the number of rooms to represent a total of 8 rooms can be deployed, the other 7bits represent the number of machines to represent each room can be deployed 128 machines. Then the data format will be as follows.

sobyte

Code implementation

Implementation steps

In fact, after reading the above data structure, you need to implement a snowflake algorithm by yourself is very simple, the steps are roughly as follows.

  1. obtain the current millisecond timestamp.
  2. compare the current millisecond timestamp with the last saved timestamp.
    1. if it is equal to the last saved timestamp, then add one to the sequence number sequence.
    2. if they are not equal, then set the sequence to 0 directly.
  3. then the or operation is used to stitch the int64 return value that the snowflake algorithm needs to return.

code implementation

First we need to define a Snowflake structure.

1
2
3
4
5
6
7
type Snowflake struct {
    sync.Mutex                  // 锁
    timestamp    int64  // 时间戳 ,毫秒
    workerid     int64  // 工作节点
    datacenterid int64  // 数据中心机房id
    sequence     int64  // 序列号
}

Then we need to define some constants to facilitate bitwise fetching when we use the snowflake algorithm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
const (
  epoch             = int64(1577808000000)                           // 设置起始时间(时间戳/毫秒):2020-01-01 00:00:00,有效期69年
    timestampBits     = uint(41)                                       // 时间戳占用位数
    datacenteridBits  = uint(2)                                        // 数据中心id所占位数
    workeridBits      = uint(7)                                        // 机器id所占位数
    sequenceBits      = uint(12)                                       // 序列所占的位数
    timestampMax      = int64(-1 ^ (-1 << timestampBits))              // 时间戳最大值
    datacenteridMax   = int64(-1 ^ (-1 << datacenteridBits))           // 支持的最大数据中心id数量
    workeridMax       = int64(-1 ^ (-1 << workeridBits))               // 支持的最大机器id数量
    sequenceMask      = int64(-1 ^ (-1 << sequenceBits))               // 支持的最大序列id数量
    workeridShift     = sequenceBits                                   // 机器id左移位数
    datacenteridShift = sequenceBits + workeridBits                    // 数据中心id左移位数
    timestampShift    = sequenceBits + workeridBits + datacenteridBits // 时间戳左移位数
)

It is important to note that since -1 is represented in binary as

11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111

So to get the maximum value of the timestamp of 41bits you can shift -1 to the left by 41 bits to get

11111111 11111111 11111111 11111110 00000000 00000000 00000000 00000000 00000000

Then we can do the ^ iso-or operation with -1

00000000 00000000 00000001 11111111 11111111 11111111 11111111 11111111

This will represent the 41bits timestamp maximum value. The same is true for datacenteridMax and workeridMax.

Then we can set a NextVal function to get the ID returned by Snowflake.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
func (s *Snowflake) NextVal() int64 {
    s.Lock()
    now := time.Now().UnixNano() / 1000000 // 转毫秒
    if s.timestamp == now {
        // 当同一时间戳(精度:毫秒)下多次生成id会增加序列号
        s.sequence = (s.sequence + 1) & sequenceMask
        if s.sequence == 0 {
            // 如果当前序列超出12bit长度,则需要等待下一毫秒
            // 下一毫秒将使用sequence:0
            for now <= s.timestamp {
                now = time.Now().UnixNano() / 1000000
            }
        }
    } else {
        // 不同时间戳(精度:毫秒)下直接使用序列号:0
        s.sequence = 0
    }
    t := now - epoch
    if t > timestampMax {
        s.Unlock()
        glog.Errorf("epoch must be between 0 and %d", timestampMax-1)
        return 0
    }
    s.timestamp = now
    r := int64((t)<<timestampShift | (s.datacenterid << datacenteridShift) | (s.workerid << workeridShift) | (s.sequence))
    s.Unlock()
    return r
}

The above code is also very simple, look at the comments should also be able to understand, I will say here what the last returned r series of bit operation means.

First of all, t is the time difference from the epoch. Our epoch is set to 2020-01-01 00:00:00 at the time of initialization, so for the 41bit timestamp, it will overflow only after 69 years. After shifting to the left for t, all 0’s below the timestampShift position are taken or filled by datacenterid, workerid, and sequence.

In the current example, if the current time is 2021/01/01 00:00:00, then the result of the bitwise operation is as follows.

sobyte