Keep learning, keep living...

0%

Kubernetes flannel网络分析

flannelcoreos开源的Kubernetes CNI实现。它使用etcd或者Kubernetes API存储整个集群的网络配置。每个kubernetes节点上运行flanneld组件,它从etcd或者Kubernetes API获取集群的网络地址空间,并在空间内获取一个subnet,该节点上的容器IP都从这个subnet中分配,从而保证不同节点上的IP不会冲突。flannel通过不同的backend来实现跨主机的容器网络通信,目前支持udp,vxlan,host-gw等一系列backend实现。本文介绍vxlan backend下的容器通信过程。

flannelv0.9.0版本上对vxlan的实现作了改动。源码中有一段非常详细的注释介绍了不同版本的设计与实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Some design notes and history:
// VXLAN encapsulates L2 packets (though flannel is L3 only so don't expect to be able to send L2 packets across hosts)
// The first versions of vxlan for flannel registered the flannel daemon as a handler for both "L2" and "L3" misses
// - When a container sends a packet to a new IP address on the flannel network (but on a different host) this generates
// an L2 miss (i.e. an ARP lookup)
// - The flannel daemon knows which flannel host the packet is destined for so it can supply the VTEP MAC to use.
// This is stored in the ARP table (with a timeout) to avoid constantly looking it up.
// - The packet can then be encapsulated but the host needs to know where to send it. This creates another callout from
// the kernal vxlan code to the flannel daemon to get the public IP that should be used for that VTEP (this gets called
// an L3 miss). The L2/L3 miss hooks are registered when the vxlan device is created. At the same time a device route
// is created to the whole flannel network so that non-local traffic is sent over the vxlan device.
//
// In this scheme the scaling of table entries (per host) is:
// - 1 route (for the configured network out the vxlan device)
// - One arp entry for each remote container that this host has recently contacted
// - One FDB entry for each remote host
//
// The second version of flannel vxlan removed the need for the L3MISS callout. When a new remote host is found (either
// during startup or when it's created), flannel simply adds the required entries so that no further lookup/callout is required.
//
//
// The latest version of the vxlan backend removes the need for the L2MISS too, which means that the flannel deamon is not
// listening for any netlink messages anymore. This improves reliability (no problems with timeouts if
// flannel crashes or restarts) and simplifies upgrades.
//
// How it works:
// Create the vxlan device but don't register for any L2MISS or L3MISS messages
// Then, as each remote host is discovered (either on startup or when they are added), do the following
// 1) create routing table entry for the remote subnet. It goes via the vxlan device but also specifies a next hop (of the remote flannel host).
// 2) Create a static ARP entry for the remote flannel host IP address (and the VTEP MAC)
// 3) Create an FDB entry with the VTEP MAC and the public IP of the remote flannel daemon.
//
// In this scheme the scaling of table entries is linear to the number of remote hosts - 1 route, 1 arp entry and 1 FDB entry per host
//
// In this newest scheme, there is also the option of skipping the use of vxlan for hosts that are on the same subnet,
// this is called "directRouting"

v0.9.0之前版本的实现主要依赖vxlan内核模块的L2MISSL3MISS消息机制。L2MISS是指vxlan设备在ARP表中找不到内层IP对应的MAC地址时会给用户态程序发送netlink消息。L3MISS是指vxlan设备在FDB表中找不到VXLAN协议内层MAC地址所属的VTEPIP地址时会给用户态程序发送netlink消息。之前的文章<<动态维护FDB表项实现VXLAN通信>>介绍过相关概念和操作。本文主要分析v0.9.0版本上的实现方式。

之前的方式实现是,flanneld作为L2MISSL3MISS消息的处理器,当收到相应消息时从etcd或者kubernetes API获取到相应的ARP或者FDB信息来填充相应条目。如果flanneld异常退出,那么整个容器网络集群的网络就中断了。这是一个很大的隐患。v0.9.0实现不再需要处理L2MISSL3MISS消息,而是由flanneld通过watch etcd或者kubernetes API的相关节点信息来动态地维护各节点通信所需的ARPFDB以及路由条目。即使flanneld崩溃,整个集群网络数据转发依然可以运行。这个实现很优雅,每个节点只需要一条路由,一个ARP缓存条目和一个FDB条目。

下面在实验环境中分析flannel vxlan的网络通信过程。整个网络架构如图:

CNI配置文件/etc/cni/net.d/09-flannel.conf内容如下:

1
2
3
4
5
6
7
8
9
{
"name": "cbr0",
"cniVersion": "0.3.1",
"type": "flannel",

"delegate": {
"isDefaultGateway": true
}
}

节点上每个pod会有一对veth pair设备,其中一端放在podnetwork namespace中,另一端在宿主机上接在cni0网桥上。flanneld启动时创建了vxlan设备:flannel.1

node1上的flannel网络信息如下,分配的subnet10.230.41.1/24:

1
2
3
4
5
[root@node1 ~]# cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.230.0.0/16
FLANNEL_SUBNET=10.230.41.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=false

node2上的flannel网络信息如下, 分配的subnet10.230.93.1/24:

1
2
3
4
5
[root@node2 ~]# cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.230.0.0/16
FLANNEL_SUBNET=10.230.93.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=false

我们来看10.230.41.1710.230.93.2发送数据包的过程。

10.230.93.210.230.41.17不在同一二层网络,因而需要查找路由来决定由哪个设备发送到哪里。10.230.41.17的路由如下:

1
2
3
4
[root@master1 ~]# kubectl exec -it busybox2-6f8fdb784d-r6ln2 -- ip route
default via 10.230.41.1 dev eth0
10.230.0.0/16 via 10.230.41.1 dev eth0
10.230.41.0/24 dev eth0 scope link src 10.230.41.17

匹配到默认路由,因而需要发送到网关10.230.41.110.230.41.1配置在网桥cni0上。内核通过ARP请求获得10.230.41.1MAC地址, 将数据包转发到cni0上。

1
2
3
4
5
6
7
[root@node1 ~]# ip addr show dev cni0
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 86:99:b6:37:95:b2 brd ff:ff:ff:ff:ff:ff
inet 10.230.41.1/24 brd 10.230.41.255 scope global cni0
valid_lft forever preferred_lft forever
inet6 fe80::8499:b6ff:fe37:95b2/64 scope link
valid_lft forever preferred_lft forever

flanneld在加入集群时会为每个其他节点生成一条on-link路由,on-link路由表示是直连路由,匹配该条路由的数据包将触发ARP请求获取目的IP的MAC地址。在node1上查看路由信息:

1
2
[root@node1 ~]# ip route show dev flannel.1
10.230.93.0/24 via 10.230.93.0 onlink

cni0设备根据这条路由将数据包转给vxlan设备flannel.1,并且接收端的IP地址为10.230.93.0, 需要通过ARP获取MAC地址。

flannel.1的信息如下, 可以看到没有开启l2missl3miss:

1
2
3
4
[root@node1 ~]# ip -d link show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether a6:f7:8b:a4:60:b0 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 1 local 10.240.0.101 dev eth1 srcport 0 0 dstport 8472 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

vxlan设备需要对接收到的数据包进行VXLAN协议封装。它需要知道对端10.230.93.0MAC地址。而flanneld在启动时已经根据从etcdkubernetes API获取到的信息写入到ARP表中:

1
2
[root@node1 ~]# ip neigh show dev flannel.1
10.230.93.0 lladdr 2a:02:24:58:e9:07 PERMANENT

这样获取到10.230.93.0MAC地址后,就可以完成内层数据的封装。数据包封装完成后,它需要获得对应这个MAC地址的VTEPIP地址。flanneld已经在启动时写入FDB条目:

1
2
[root@node1 ~]# bridge fdb show dev flannel.1
2a:02:24:58:e9:07 dst 10.240.0.102 self permanent

可以看到2a:02:24:58:e9:07对应的VTEP IP10.240.0.102。这时flannel.1这个vxlan设备知道数据包要发送的目的IP,根据主机的路由策略从eth1设备发出。主机路由信息如下:

1
2
3
4
5
6
7
8
[root@node1 ~]# ip route
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
10.230.41.0/24 dev cni0 proto kernel scope link src 10.230.41.1
10.230.93.0/24 via 10.230.93.0 dev flannel.1 onlink
10.240.0.0/24 dev eth1 proto kernel scope link src 10.240.0.101
169.254.0.0/16 dev eth0 scope link metric 1002
169.254.0.0/16 dev eth1 scope link metric 1003

数据包到达node2eth1后,eth1将收到VXLAN数据包, 数据包中的MAC地址为:2a:02:24:58:e9:07, 正是node2节点上flannel.1的地址, 将它转给flannel.1设备:

1
2
3
4
5
6
7
[root@node2 ~]# ip addr show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 2a:02:24:58:e9:07 brd ff:ff:ff:ff:ff:ff
inet 10.230.93.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::2802:24ff:fe58:e907/64 scope link
valid_lft forever preferred_lft forever

flannel.1解包之后,根据内层目的地址:10.240.93.2查找路由转发到cni0:

1
2
3
4
5
6
[root@node2 ~]# ip route
default via 10.0.2.2 dev eth0 proto dhcp metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.230.41.0/24 via 10.230.41.0 dev flannel.1 onlink
10.230.93.0/24 dev cni0 proto kernel scope link src 10.230.93.1
10.240.0.0/24 dev eth1 proto kernel scope link src 10.240.0.102 metric 101

cni0再通过ARP请求获得10.230.93.2MAC地址,从而将数据包转发到相应的POD中的veth pair设备,从而到达容器中。

回包的路径是一样的,不再详述。

下面简要分析一下flanneld的源码实现。

main函数中首先调用newSubnetManager创建SubnetManager

1
2
3
4
5
6
sm, err := newSubnetManager()
if err != nil {
log.Error("Failed to create SubnetManager: ", err)
os.Exit(1)
}
log.Infof("Created subnet manager: %s", sm.Name())

SubnetManager用于向网络配置存储租用或续组subnet。每个节点都会有自己的一个subnet,保证了节点之间的IP不会冲突。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
func newSubnetManager() (subnet.Manager, error) {
if opts.kubeSubnetMgr {
return kube.NewSubnetManager(opts.kubeApiUrl, opts.kubeConfigFile)
}

cfg := &etcdv2.EtcdConfig{
Endpoints: strings.Split(opts.etcdEndpoints, ","),
Keyfile: opts.etcdKeyfile,
Certfile: opts.etcdCertfile,
CAFile: opts.etcdCAFile,
Prefix: opts.etcdPrefix,
Username: opts.etcdUsername,
Password: opts.etcdPassword,
}

// Attempt to renew the lease for the subnet specified in the subnetFile
prevSubnet := ReadSubnetFromSubnetFile(opts.subnetFile)

return etcdv2.NewLocalManager(cfg, prevSubnet)
}

如果命令行参数中指定了kube-subnet-mgr, 则使用kubernetes API作为全局网络配置存储,否则使用etcd

接着调用getConfig从全局配置存储获取网络配置, 包括容器集群的网络信息,backend的配置等等:

1
2
3
4
5
6
// Fetch the network config (i.e. what backend to use etc..).
config, err := getConfig(ctx, sm)
if err == errCanceled {
wg.Wait()
os.Exit(0)
}

比如,我的实验环境写到etcd的配置内容为:

1
{"Network":"10.230.0.0/16","SubnetLen":24, "Backend":{"Type": "vxlan"}}

接下来,main函数会调用backend.NewManager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Create a backend manager then use it to create the backend and register the network with it.
bm := backend.NewManager(ctx, sm, extIface)
be, err := bm.GetBackend(config.BackendType)
if err != nil {
log.Errorf("Error fetching backend: %s", err)
cancel()
wg.Wait()
os.Exit(1)
}

bn, err := be.RegisterNetwork(ctx, config)
if err != nil {
log.Errorf("Error registering network: %s", err)
cancel()
wg.Wait()
os.Exit(1)
}

开头时也介绍过,flannel通过backend机制来支持各种不同的跨主机通信方式。不同的实现方式会在init函数中向backend注册自己的构造函数。比如,package vxlaninit函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
func init() {
backend.Register("vxlan", New)
}

const (
defaultVNI = 1
)

type VXLANBackend struct {
subnetMgr subnet.Manager
extIface *backend.ExternalInterface
}

func New(sm subnet.Manager, extIface *backend.ExternalInterface) (backend.Backend, error) {
backend := &VXLANBackend{
subnetMgr: sm,
extIface: extIface,
}

return backend, nil
}

be.RegisterNetwork会调用到package vxlanRegisterNetwork:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
func (be *VXLANBackend) RegisterNetwork(ctx context.Context, config *subnet.Config) (backend.Network, error) {
// Parse our configuration
cfg := struct {
VNI int
Port int
GBP bool
DirectRouting bool
}{
VNI: defaultVNI,
}

if len(config.Backend) > 0 {
if err := json.Unmarshal(config.Backend, &cfg); err != nil {
return nil, fmt.Errorf("error decoding VXLAN backend config: %v", err)
}
}
log.Infof("VXLAN config: VNI=%d Port=%d GBP=%v DirectRouting=%v", cfg.VNI, cfg.Port, cfg.GBP, cfg.DirectRouting)

devAttrs := vxlanDeviceAttrs{
vni: uint32(cfg.VNI),
name: fmt.Sprintf("flannel.%v", cfg.VNI),
vtepIndex: be.extIface.Iface.Index,
vtepAddr: be.extIface.IfaceAddr,
vtepPort: cfg.Port,
gbp: cfg.GBP,
}

dev, err := newVXLANDevice(&devAttrs)
if err != nil {
return nil, err
}
dev.directRouting = cfg.DirectRouting

subnetAttrs, err := newSubnetAttrs(be.extIface.ExtAddr, dev.MACAddr())
if err != nil {
return nil, err
}

lease, err := be.subnetMgr.AcquireLease(ctx, subnetAttrs)
switch err {
case nil:
case context.Canceled, context.DeadlineExceeded:
return nil, err
default:
return nil, fmt.Errorf("failed to acquire lease: %v", err)
}

// Ensure that the device has a /32 address so that no broadcast routes are created.
// This IP is just used as a source address for host to workload traffic (so
// the return path for the traffic has an address on the flannel network to use as the destination)
if err := dev.Configure(ip.IP4Net{IP: lease.Subnet.IP, PrefixLen: 32}); err != nil {
return nil, fmt.Errorf("failed to configure interface %s: %s", dev.link.Attrs().Name, err)
}

return newNetwork(be.subnetMgr, be.extIface, dev, ip.IP4Net{}, lease)
}

RegisterNetwork函数会调用newVXLANDevice创建一个vxlan设备,就对应我们实验环境中的flannel.1。从代码也可以看到flannel.1设备名中的1指的是VNI, 我们可以通过在全局配置存储中设置为其他值。然后获取本地VTEPIP地址以及vxlan设备的MAC地址填充到subnetAttrs结构调用be.subnetMgr.AcquireLease。这最终会调用到package etcdv2tryAcquireLeasetryAcquireLease则会调用m.registry.createSubnet或者m.registry.updateSubnet去向etcd中写入相应的Subnet信息,完成相应Subnet的租用。这时,如果已经有其他节点的flanneldwatch etcd上的subnets的key,则会触发添加路由、ARPFDB条目的逻辑。这个下面我们再详细描述具体实现。之后,调用dev.Configurevxlan设备配置一个掩码为32的地址防止广播路由创建。

RegisterNetwork返回后,main函数会调用WriteSubnetFile将获取到的网络信息写入subnetFile中,默认是/run/flannel/subnet.env,后续flanneld再启动时就会优先尝试使用这个文件中记录的信息去续组subnet:

1
2
3
4
5
6
if err := WriteSubnetFile(opts.subnetFile, config.Network, opts.ipMasq, bn); err != nil {
// Continue, even though it failed.
log.Warningf("Failed to write subnet file: %s", err)
} else {
log.Infof("Wrote subnet file to %s", opts.subnetFile)
}

接着,main函数中启动一个goroutine去运行bn.Run:

1
2
3
4
5
6
7
// Start "Running" the backend network. This will block until the context is done so run in another goroutine.
log.Info("Running backend.")
wg.Add(1)
go func() {
bn.Run(ctx)
wg.Done()
}()

这会调用到package vxlanRun实现,它会调用subnet.WatchLeases去获取全局范围的subnet情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (nw *network) Run(ctx context.Context) {
wg := sync.WaitGroup{}

log.V(0).Info("watching for new subnet leases")
events := make(chan []subnet.Event)
wg.Add(1)
go func() {
subnet.WatchLeases(ctx, nw.subnetMgr, nw.SubnetLease, events)
log.V(1).Info("WatchLeases exited")
wg.Done()
}()

defer wg.Wait()

for {
select {
case evtBatch := <-events:
nw.handleSubnetEvents(evtBatch)

case <-ctx.Done():
return
}
}
}

package subnetWatchLeases函数中会一直循环调用sm.WatchLeasessm.WatchLeases首次运行时会获取到当前etcd中已有的subnet信息,之后则开始watch etcdsubnets key获得变更的subnet信息。这些subnet信息传送给channel:receiver:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func WatchLeases(ctx context.Context, sm Manager, ownLease *Lease, receiver chan []Event) {
lw := &leaseWatcher{
ownLease: ownLease,
}
var cursor interface{}

for {
res, err := sm.WatchLeases(ctx, cursor)
if err != nil {
if err == context.Canceled || err == context.DeadlineExceeded {
return
}

log.Errorf("Watch subnets: %v", err)
time.Sleep(time.Second)
continue
}

cursor = res.Cursor

var batch []Event

if len(res.Events) > 0 {
batch = lw.update(res.Events)
} else {
batch = lw.reset(res.Snapshot)
}

if len(batch) > 0 {
receiver <- batch
}
}
}

receiver的接收端协程则调用nw.handleSubnetEvents(evtBatch)来处理这些消息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
func (nw *network) handleSubnetEvents(batch []subnet.Event) {
for _, event := range batch {
sn := event.Lease.Subnet
attrs := event.Lease.Attrs
if attrs.BackendType != "vxlan" {
log.Warningf("ignoring non-vxlan subnet(%s): type=%v", sn, attrs.BackendType)
continue
}

var vxlanAttrs vxlanLeaseAttrs
if err := json.Unmarshal(attrs.BackendData, &vxlanAttrs); err != nil {
log.Error("error decoding subnet lease JSON: ", err)
continue
}

// This route is used when traffic should be vxlan encapsulated
vxlanRoute := netlink.Route{
LinkIndex: nw.dev.link.Attrs().Index,
Scope: netlink.SCOPE_UNIVERSE,
Dst: sn.ToIPNet(),
Gw: sn.IP.ToIP(),
}
vxlanRoute.SetFlag(syscall.RTNH_F_ONLINK)

// directRouting is where the remote host is on the same subnet so vxlan isn't required.
directRoute := netlink.Route{
Dst: sn.ToIPNet(),
Gw: attrs.PublicIP.ToIP(),
}
var directRoutingOK = false
if nw.dev.directRouting {
routes, err := netlink.RouteGet(attrs.PublicIP.ToIP())
if err != nil {
log.Errorf("Couldn't lookup route to %v: %v", attrs.PublicIP, err)
continue
}
if len(routes) == 1 && routes[0].Gw == nil {
// There is only a single route and there's no gateway (i.e. it's directly connected)
directRoutingOK = true
}
}

switch event.Type {
case subnet.EventAdded:
if directRoutingOK {
log.V(2).Infof("Adding direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)

if err := netlink.RouteReplace(&directRoute); err != nil {
log.Errorf("Error adding route to %v via %v: %v", sn, attrs.PublicIP, err)
continue
}
} else {
log.V(2).Infof("adding subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
if err := nw.dev.AddARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("AddARP failed: ", err)
continue
}

if err := nw.dev.AddFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("AddFDB failed: ", err)

// Try to clean up the ARP entry then continue
if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}

continue
}

// Set the route - the kernel would ARP for the Gw IP address if it hadn't already been set above so make sure
// this is done last.
if err := netlink.RouteReplace(&vxlanRoute); err != nil {
log.Errorf("failed to add vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)

// Try to clean up both the ARP and FDB entries then continue
if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}

if err := nw.dev.DelFDB(neighbor{IP: event.Lease.Attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelFDB failed: ", err)
}

continue
}
}
case subnet.EventRemoved:
if directRoutingOK {
log.V(2).Infof("Removing direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)
if err := netlink.RouteDel(&directRoute); err != nil {
log.Errorf("Error deleting route to %v via %v: %v", sn, attrs.PublicIP, err)
}
} else {
log.V(2).Infof("removing subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))

// Try to remove all entries - don't bail out if one of them fails.
if err := nw.dev.DelARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}

if err := nw.dev.DelFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelFDB failed: ", err)
}

if err := netlink.RouteDel(&vxlanRoute); err != nil {
log.Errorf("failed to delete vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)
}
}
default:
log.Error("internal error: unknown event type: ", int(event.Type))
}
}
}

这里我们忽略directRouting相关内容。EventAdded表示有新的节点上线,首先调用nw.dev.AddARPvxlan设备添加ARP条目,MACIP分别为新上线节点上vxlan设备的MAC地址以及上面所配置的32位掩码的IP地址。接着调用nw.dev.AddFDBvxlan设备上添加FDB条目,MACIP分别为新上线节点上的vxlan设备的MAC地址以及新节点上的VTEP的IP地址。最后,再调用netlink.RouteReplace(&vxlanRoute)去添加经由32位掩码地址到达新上线subnet的路由。代码注释里也说明了,最后再添加路由是为了防止在ARP缓存没有填加的情况下发起ARP请求。

EventRemoved表示有节点下线,这里分别调用nw.dev.DelARP,nw.dev.DelFDB,netlink.RouteDel删除相应的ARP,FDB和路由条目。

main函数的逻辑里接下来还会调用MonitorLease去定期续租subnet,这里不再详述。

参考: