In today’s cloud era gRPC is very popular, and the default serialization encoding of gRPC Protocol Buffers is also popular. It is said that Protocol Buffers is very efficient, but what is it? Today we will discuss the coding rules of Protocol Buffers.

There are basically two types of objects in the code, one with a fixed length, such as int32 occupying 32 bits and double occupying 64 bits, and the other with a variable length, such as strings. So, when designing the encoding, the first thing you have to do is to distinguish between these two cases. The easiest way is to use a byte to represent the type and transfer the data immediately after, as shown in the following figure.

1
2
3
4
5
    type           data
 +--------+--------+~~+--------+
 |xxxxxxxx|xxxxxxxx|  |xxxxxxxx|
 +--------+--------+~~+--------+
  7      0 7      0    7      0

A byte has 256 values. We can assign a number to each type. When decoding, we first read first byte and then read the data of the corresponding length according to the different types. For fixed-length type, the decoding is done here. For variable-length type data, we also need to confirm the length of the data. How to transfer this length? Let’s take string as an example. First pass a byte to indicate the length, and then pass the real string, as follows.

1
2
3
4
5
type=string  length   data
  +--------+--------+~~+--------+
  |xxxxxxxx|xxxxxxxx|  |xxxxxxxx|
  +--------+--------+~~+--------+
   7      0 7      0    7      0

The maximum value that a byte can represent is 255, so the length of a string cannot exceed 255 bytes. This is definitely not possible. What can we do? If we use two bytes, the maximum length we can represent will be extended to 65535 bytes. This is sufficient for general use. But what if we want to transfer a longer string? What about adding more bytes?

Because length is variable, using fixed-length bytes is inflexible: too short and the range is too small; too long and the transmission is too inefficient. If we use 4 bytes for length 1, we have 0x00 0x00 0x00 0x00 0x01, the high 31 bits are all zeros and no information is passed.

Wouldn’t it be more efficient if we could remove these zeros? For example, for length 1 pass only 0x01 and for length 4112 pass only 0x1010. But then, this leads to another problem: how to determine the number of bytes needed to represent the length. We seem to be back to square one. In order to represent the length of a string, we introduced the length field, which is now also uncertain.

Different protocols use different approaches to this problem.

For example, the websockt protocol defines a 7bit field for length, the maximum length that can be represented is 127, which is definitely not enough. So the websocket protocol also specifies that when the length is 126, two bytes are transmitted immediately afterwards to represent the real length, and when the length is 127, eight bytes are transmitted immediately afterwards to represent the real length. This is a fragment of the defined message format from RFC6455.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-------+-+-------------+-------------------------------+
 |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
 |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
 |N|V|V|V|       |S|             |   (if payload len==126/127)   |
 | |1|2|3|       |K|             |                               |
 +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
 |     Extended payload length continued, if payload len == 127  |
 + - - - - - - - - - - - - - - - +-------------------------------+
 ~                                                               ~

In websocket, a length of 0~126 takes up one byte; 128~65534 takes up two bytes, and greater than 65535 takes up eight bytes. websockt’s approach is really an improvement over simply setting a fixed length byte. However, it is a bit too coarse to divide the length into only three intervals. For example, words between 128 and 256 take up two bytes; once the length exceeds 65534, it takes up eight bytes.

Obviously, we need a more fine-grained segmentation scheme. This is where VarInts comes in.

The websoket protocol exploits the numbers 126 and 127 to represent the total number of bytes in the length field to achieve dynamic expansion. varInts exploits the maximum bit (MSB) of each byte. The specific representation is shown in the following table.

1
2
3
4
5
   0 ~ 2^07 - 1 0xxxxxxx
2^07 ~ 2^14 - 1 1xxxxxxx 0xxxxxxx
2^14 ~ 2^21 - 1 1xxxxxxx 1xxxxxxx 0xxxxxxx
2^21 ~ 2^28 - 1 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx
2^28 ~ 2^35 - 1 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx

In simple terms, when decoding, if the MSB of the byte read is 1, it means there is still a subsequence byte, until the byte with MSB 0 is read. For example, length 624485 can be encoded as follows.

1
2
3
4
5
6
7
MSB ------------------ LSB
      10011000011101100101  <1> 二进制 20bit
 0100110  0001110  1100101  <2> 从 LSB 到 MSB 每七位分一组,不足七位高位补零
00100110 10001110 11100101  <3> 最右边一组高位补零,其他组高位补一
    0x26     0x8E     0xE5  <4> 转换成十六进制

→ 0xE5 0x8E 0x26            <5> 从 LSB 到 MSB 输出结果

VarInts requires up to 10 bytes if it is to support a 64-bit integer range. VarInts is a bit worse than websocket in terms of maximum byte footprint. However, for small numbers, VarInts is more compact than websocket. You can think of websocket’s method as three speeds and VarInts as incremental speed.

With VarInts, we have solved the representation problem for fixed-length data and variable-length data. Is this the end of the coding design? No. We still have to solve the field mapping problem.

For the field mapping problem, json and xml are both encoded with the field names directly, e.g.

1
2
3
4
{
  "foo":1,
  "bar": "hello"
}

The biggest advantage of this is that it is easy to read and the encoding and decoding logic are not dependent on each other. However, the disadvantage is also obvious: the transmission is inefficient. Protocol Buffers uses a different strategy, numbering the fields. As an example.

1
2
3
4
message Foo {
  int32  foo = 1;
  string bar = 2;
}

Please pay attention to the number after the equal sign in each field. This number, also called tag, cannot be repeated and corresponds one-to-one with the field . Each field in Protocol Buffers is logically divided into three parts after encoding.

1
<tag> <type> [<length>] <data>

tag, type, and length are all represented by VarInts.

Protocol Buffers defines 4 types of type in version 3.

  • 0 VarInt for int32, int64, uint32, uint64, sint32, sint64, bool, enum
  • 1 64-bit denotes fixed64, sfixed64, double
  • 2 Length-delimited denotes string, bytes, embedded messages, repeated fields
  • 5 32-bit denotes fixed32, sfixed32, float

The types represented by 3 and 4 are deprecated and will not be discussed. Because of the small number of types, Protocol Buffers uses only 3 bits for encoding, and actually transmits them as (tag<<3)|type.

As an example. If the foo field of the preceding Foo takes the value 1, the corresponding code is: 0x08 0x01. The type of foo is int32, and the corresponding type takes 0. And its tag is 1, so the first byte is (1<<3)|0 = 0x08, and the second byte is the VarInts encoding for the number 1, which is 0x01.

1
2
3
4
5
  7       0 7      0
 +-----+---+--------+
 |00001|000|00000001|
 +-----+---+--------+
   tag type   data

Let’s take another example of a string. If the bar field of Foo takes the value Lu, the corresponding code is: 0x12 0x03 0xe5 0x90 0x95. The type of bar is string, and the corresponding type is 2. And its tag is 2, so the first byte is (2<<3)|2 = 0x12, the second byte means the length of the string is 3, and the next 3 bytes are the Chinese character UTF-8 encoding.

1
2
3
4
5
  7       0 7      0 25        0
 +-----+---+--------+===========+
 |00010|010|00000011|0xe50x90x95|
 +-----+---+--------+===========+
   tag type  length    utf-8

The advantage of using tag is that you don’t have to transfer field names repeatedly, but it is also its disadvantage. Because there are no field names, the code that encodes and decodes must hold a mapping of field names to tags, which is done automatically when the code is generated. That is, you cannot decode Protocol Buffers data without a proto file.

Protocol Buffers also supports custom message fields and repeated fields. These two types of fields are encoded as string equivalents. Let’s start with an example of a repeated field.

1
2
3
4
5
6
7
message Baz {
  int32 b = 1;
}
message Bar {
  repeated int32 a = 1;
             Baz b = 2;
}

If we let the a field of Bar take [1,2,3] and let the b field take {b:4}, the corresponding code is 0x0a 0x03 0x01 0x02 0x03 0x12 0x02 0x08 0x04. This data can be split into two parts.

  1. 0x0a 0x03 0x01 0x02 0x03
  2. 0x12 0x02 0x08 0x04

Let’s start with the first part. Because the type of a is repeated int32, so the corresponding type takes 2; and the tag of a is 1, so the first byte should be (1<<3|2) = 0x0a. The second byte represents the length of the array, so it is 0x03, and the next three bytes are the VarInts encoding of 1 , 2 , and 3 respectively.

And the second part. Because the type of b is Bar, the corresponding type is also 2; and the tag of b is 2, so the first byte should be (2<<3|2) = 0x12. The second byte represents the length of message, so it is 0x02, and the next two bytes represent the encoding of Baz.

1
2
3
4
5
6
7
  7       0 7      0 25        0 7       0 7      0 7       0 7      0
 +-----+---+--------+===========+-----+---+--------+=====+===+========+
 |00001|010|00000011|0x010x02x03|00010|010|00000010|00001|000|00000100|
 +-----+---+--------+===========+-----+---+--------+=====+===+========+
   tag type  length    utf-8      tag type length    tag type   data
                                                   |<----- baz.b ---->|
 |<---------- foo.a ----------->|<-------------- foo.b -------------->|

We can’t distinguish between string, repeated and message from the data alone, and we have to rely on the proto definition to parse this kind of data.

VarInts are not well suited to represent negative numbers. Because negative numbers are represented by computers using complement, conversion to unit64 is a very large number. When you use VarInts, -1 actually takes up 10 bytes! That’s not tolerable! For this reason, Protocol Buffers has introduced two types of sint32 and sint64 to encode numbers into ZigZag, which is a simple way to represent negative numbers with positive numbers.

1
2
3
 +-- 0 -1  1 -2  2 -3  3...
 |   +--+--+--+--+--+--+----> 
 +-> 0  1  2  3  4  5  6...

Again, looking at the data alone, we can’t tell if the integers are encoded with ZigZag or not, and this information can only be obtained from the proto file.

There is one more issue to note with Protocol Buffers, and that is the range of tag values. Because VarInts are used, the highest bit of a single byte is zero, and the lowest three bits indicate the type, so only four bits are left available. This means that when you have more than 16 fields, you need to use more than two bytes to represent them.

That’s pretty much all there is to say about the encoding of Protocol Buffers. To summarize, there are a few things to keep in mind when using Protocol Buffers.

  1. don’t modify the field tag.
  2. try not to have more than 16 fields.
  3. try to use small integers.
  4. If you need to transfer negative numbers, use sint32 or sint64.