VXLAN is Virtual eXtensible Local Area Network, a virtual extensible local area network. It is an overlay technology that builds a virtual layer 2 network over a layer 3 network. rfc7348 (Ref. 1) describes it as follows.

A framework for overlaying virtualized layer 2 networks over lay 3 networks.

Every technology that comes out has a problem to solve, and VXLAN is no exception, so let’s take a look at what problems VXLAN is trying to solve.

  • The rise of virtualization (virtual machines and containers) has led to a data center with thousands of machines needing to communicate, and traditional VLAN technology can only support a network cap of 4096, which can no longer meet the ever-expanding data center scale
  • More and more data centers (especially public cloud services) need to provide multi-tenant capabilities, and different users need to independently assign ip and MAC addresses, how to ensure the scalability and correctness of this function is also a problem to be solved
  • The cloud computing business requires high business flexibility, and virtual machines may be migrated on a large scale, and the network is always available, which is the concept of big two-tier. Solving this problem while ensuring that the broadcast domain of layer 2 does not expand excessively is also a requirement for cloud computing networks

Traditional Layer 2+3 networks have become overwhelmed with these requirements, and while many improved technologies such as stacking, SVF, TRILL, etc. can increase the scope of Layer 2 in an effort to improve the classic network, it is very difficult to make small changes to the network while ensuring high flexibility.

To solve these problems, many solutions have been proposed, and vxlan is one of them. vxlan was introduced by VMware, Cisco, and a number of other large companies, and the standard is currently documented in RFC7348.

VXLAN model

One of the features of vxlan is that it has a low impact on the original network architecture, so that the original network does not need to be changed and a new layer is built on top of it.

Naturally, vxlan introduces some new concepts, so this section will talk about them. The following diagram shows the working model of vxlan, which is created on top of the original IP network (layer 3) and can be deployed on any network that is layer 3 reachable (capable of communicating with each other over IP). a vtep at each endpoint is responsible for the encapsulation and unencapsulation of vxlan protocol messages, i.e., the header of the vtep communication is encapsulated on the virtual message. Multiple vxlan networks can be created on the physical network, and these vxlan networks can be thought of as tunnels through which virtual machines from different nodes can be directly connected. Each vxlan network is identified by a unique VNI, and different vxlan’s can not affect each other.

VXLAN model

  • VTEP (VXLAN Tunnel Endpoints): Edge device for vxlan networks, used for vxlan message processing (packet and unpacket). vtep can be a network device (such as a switch) or a machine (such as a host in a virtualized cluster)
  • VNI (VXLAN Network Identifier): VNI is the identity of each vxlan, a 24-bit integer with 2^24 = 16,777,216 (more than 10 million), generally each VNI corresponds to a tenant, which means that a public cloud built with vxlan can theoretically support ten million tenants.
  • Tunnel: Tunneling is a logical concept, and there is no specific physical entity in the vxlan model to correspond to. A tunnel can be seen as a virtual channel where both sides of the vxlan communication (the virtual machines in the diagram) believe they are communicating directly and are unaware of the existence of the underlying network. As a whole, each vxlan network looks like a separate communication channel, or tunnel, for the communicating VMs

For now, these concepts are still very obscure and difficult to understand. We will explain the message and communication flow of a vxlan network below, so hopefully we can come back to these concepts after the article is over and understand what they mean.

VXLAN message parsing

As mentioned earlier, vxlan builds a virtual Layer 2 network out of a Layer 3 network, which is clearly reflected in the vxlan messages.

The figure below shows the vxlan protocol message, the white part is the virtual machine sending the message (layer 2 frame with MAC header, IP header and transport layer header), preceded by the vxlan header to store the vxlan related content, and preceded by the standard UDP protocol header (UDP header, IP header and MAC header) to transmit the message on the underlying network.

Three parts can be seen in this message.

  1. the outermost UDP protocol message is used to transmit on the underlying network, which is the basis for vtep to communicate with each other
  2. the middle is the VXLAN header, after the vtep accepts the message, it removes the previous UDP protocol part and processes the logic of the vxlan according to this part, which is mainly sent to the final virtual machine according to the VNI
  3. the innermost part is the original message, that is, the contents of the message seen by the virtual machine

The meaning of each part of the message is as follows.

  • VXLAN header: the part related to vxlan protocol, 8 bytes in total
    • VXLAN flags: flags bits
    • Reserved: Reserved bit
    • VNID: 24-bit VNI field, which is where vxlan can support millions of tenants
    • Reserved: reserved field
  • UDP header, 8 bytes
    • UDP application communication parties are vtep applications, where the destination port is the port used by the receiving vtep, IANA assigned port is 4789
  • IP header: 20 bytes
    • Address for communication between hosts, may be the host’s NIC IP address, or may be a multicast IP address
  • MAC header: 14 bytes
    • MAC address of the communication between hosts, the source MAC address is the host MAC address, and the destination MAC address is the MAC address of the next-hop device

It can be seen that the vxlan protocol has 50 bytes more content than the original message, which reduces the percentage of valid data transmitted over the network link. vxlan header is most important for the VNID field, and the other reserved fields are mainly for future expansion, currently left to different vendors to add their own functionality with these fields.

vxlan network communication process

From the previous section, we have a general understanding of the process of sending vxlan messages. The virtual machine message is added to the vxlan and the external message layer via vtep and then sent out. The other vtep receives it, removes the vxlan header and then sends the original message to the destination virtual machine according to the VNI.

The above process is a process where both parties already know all the communication information, but there are still many issues to be resolved before the first communication.

  • Which vtep needs to be added to an identical VNI group?
  • How does the sender VM know the MAC address of the other side?
  • How does the vtep know which host the destination VM is on?

These three questions can be boiled down to the same one: how do vxlan networks sense each other’s presence and choose the correct path to transmit messages?

And the first question is also a no-brainer, because the groups formed by vtep are fictitious concepts, and only some vtep can deliver messages correctly, and they are in the same group. In other words, we just need to answer the last two questions.

To answer these two questions, let’s go back to the vxlan protocol messages and see what information is needed for a complete vxlan message.

  • Inner layer messages: Both communicating VMs either use the IP address directly or have already obtained the other’s IP address through DNS, etc., so the network layer address is already known. VMs on the same network need to communicate and also need to know the MAC address of the other VM and vxlan needs a mechanism to implement traditional network ARP
  • vxlan header: only need to know the VNI, which is generally configured directly on the vtep, either written in advance planning, or automatically generated based on internal messages, and do not need to worry about
  • UDP header: the most important thing is the source and destination port, the source port is generated and managed by the system, and the destination port is also written dead, such as the 4789 port specified by IANA, this part also does not need to worry about
  • IP header: IP header is concerned with the IP address of both sides of the vtep, the source address can be determined very simply, the destination address is the IP address of the host vtep where the virtual machine is located, this also needs to be determined by some way
  • MAC header: If the IP address of the vtep is determined, the MAC address can be obtained through the classic ARP method, after all, the vtep network in the same layer three, the classic network architecture set can be used directly

To summarize, a vxlan message needs to determine two address information: the MAC address of the destination virtual machine and the IP address of the destination vtep. If the VNI is also dynamically aware, then the vtep needs a triplet of.

Internal MAC <-> VNI <-> VTEP IP

Depending on the implementation, there are two general approaches: multicast and control center. The concept of multicast is that the vtep of the same vxlan network joins to the same multicast network, and if it needs to know the above information, it sends multicast within the group to query it; the concept of control center is that the above information of all virtual machines is kept in some centralized place, and the automation informs the vtep of the information it needs.

For each of these two approaches, we analyze them below.


The concept and working of multicast is not the focus here, so it will not be introduced. Simply put, each multicast group corresponds to a multicast IP address, and messages sent to this multicast IP address are sent to all hosts in the multicast group.

Why use multicast? Because the underlying vxlan network is three-layer, and broadcast addresses cannot traverse the three-layer network, so the only way to send messages to all vtep in the vxlan network is through multicast.

The following figure shows the workflow of vxlan messages in multicast mode. Machine A on the bottom left wants to send messages to Machine B on the bottom right through the vxlan network.

vxlan messages in multicast mode

When vtep is created it is added to the multicast group through configuration (depending on the implementation), the multicast group IP address in the figure is

  1. machine A only knows the IP address of the other party, not the MAC address, so it sends an ARP message to query, the internal ARP message is very common, the destination address is the broadcast address of all 1
  2. vtep receives the ARP message, finds that the destination MAC of the virtual machine is the broadcast address, encapsulates the vxlan protocol header (outer IP is the multicast group IP, MAC address is the multicast group MAC address) and sends it to the multicast group, the underlying network devices (switches and routers) that support multicast will send the message to all members of the group
  3. vtep receives the vxlan-encapsulated ARP request, removes the vxlan header, and saves the <VM MAC - VNI - Vtep IP> triplet by message learning to the sender, and broadcasts the original ARP message to the host
  4. The host receives the ARP request message and returns an ARP reply if the ARP message requests its own MAC address
  5. vtep-2 knows the virtual machine and vtep information, adds the ARP reply to the vxlan header (the external IP address is the IP address of vtep-1, and the VNI is the VNI of the original message) and sends it out via unicast
  6. vtep-1 receives the message, learns the triplet in the message, and records it. Then vtep unpacks the message, learns the internal IP and MAC address, and forwards it to the destination virtual machine.
  7. the virtual machine gets the ARP reply message and learns the MAC address to the destination virtual machine

In this process, there is only one multicast, because the vtep has the ability to learn automatically, and subsequent messages are sent directly via unicast. As you can see, multicast messages are very wasteful, as only one message is valid for each multicast, which is very wasteful if a multicast group has a large number of vteps. But the multicast group also has its implementation is relatively simple, does not require centralized control, only the underlying network support multicast, only the configuration of multicast group can be automatically discovered.

The process of sending unicast messages is the logic of the above reply message, which should also be very easy to understand. There is another way to communicate between different VNI networks, which requires a vxlan gateway (either a physical network device or software) that receives a vxlan network message and then decompresses it, adding another vxlan header to forward it out according to a specific logic.

Because not all network devices support multicast, and because of the message waste associated with multicast, this approach is rarely used in production.

Distributed Control Center

From the flow of multicast, it can be seen that the most critical thing for vtep to send messages is actually to know the MAC address of the other VM and the vtep IP address of the host where the VM is located. If you can know these two pieces of information in advance and tell vtep directly, then there is no need for multicast.

In the VM and container scenario, we can know the IP and MAC of a VM or container when it starts up before it has communicated over the network (either by getting it in some way or by controlling the two addresses in advance), and the distributed control center keeps this information. In addition to this, the control center also keeps which vtep’s are available for each vxlan network and what the addresses of these vtep’s are. With this information, the vtep can look up and add headers directly when sending messages, without the need to multicast to ask all over the network.

In general, there will be an agent in each vtep’s node, which will communicate with the control center to get the information needed by the vtep and tell the vtep in some way. the specific practice depends on the specific implementation, each implementation may update different information to the vtep, for example, HER (Head End Replication) just replaces the multicast group with For example, HER (Head End Replication) simply replaces the multicast group with multiple unicast messages, that is, it tells the vtep all the VTEP IP addresses of the multicast group, so that instead of sending a multicast when queried, it sends a unicast message to each vtep in the group; some implementations just tell the vtep the MAC address information of the destination virtual machine; some implementations tell the vtep the IP address corresponding to the MAC address.

In addition, there is a difference in when to tell the vtep this information. There are generally two ways to do this: the common way is to tell the vtep the virtual machine’s triplet information as soon as it is known (even if a vtep does not use this information because the virtual machine it manages will not communicate with this address), usually before the first communication has occurred; the other way is to notify the agent in some way when the vtep needs this information during the first communication, and then The agent then tells the vtep the information at that time.

Distributed control of vxlan is a typical SDN architecture, and is the most widely used approach today. Because it has a variety of implementations, and each implementation has some gaps, it is not convenient to come up with specific examples here to illustrate, as long as the above principles are understood, no matter what kind of implementation, you can quickly get started.

vxlan networking brings new issues

The vxlan protocol brings flexibility and scalability to virtual networks, allowing cloud computing networks to scale on demand and be flexibly distributed like compute and storage resources. As with all technologies in computing, this is a tradeoff, and the main problem with vxlan is its complexity and additional overhead compared to classic networks.

Additional messages and calculations

It is easy to see that each vxlan message has an additional 50 bytes of overhead, and when the vlan field is added, the overhead goes up to 54 bytes. This is a very expensive operation for small messages. Imagine if the application data for a message is only a few bytes, and the original network header plus the vxlan message header can have 100 bytes of control information.

The extra packets also bring additional computational effort, as each vxlan packet and unpacket operation is necessary, and the extra computational effort is a non-negligible impact if the software is used to implement these steps.


Another disadvantage of vxlan is the complexity. While classic networks are stretched thin when dealing with cloud computing, the classic network model has been developed for a long time, and all the deployment, monitoring, and operations and maintenance are relatively mature. If you use a vxlan network, then all of this has to be relearned, and the time and labor costs are bound to be much higher.


Reference https://houmin.cc/posts/75019d0/