Hence, this post. So I'm going to jump right into the configuration of VXLAN, but I'm going to foolishly assume you have a working unicast routing topology before we start. I'll be using single area IS-IS routing, but you can use literally whatever routing protocol you want. You just need full reach between all loopback interfaces. Here's my topology:
The interfaces facing southbound from the CSRs to the servers are unaddressed, and they're going to remain that way. To give a little insight, here's the configuration on CSR1, along with a show ip route isis.
router isis 1
net 00.0000.0000.0011.00
is-type level-1
log-adjacency-changes
passive-interface Loopback0
!
interface Loopback0
description Loopback
ip address 10.11.11.11 255.255.255.255
!
interface GigabitEthernet2
description to SPINE1
ip address 10.1.11.11 255.255.255.0
ip router isis 1
!
interface GigabitEthernet3
description to SPINE2
ip address 10.2.12.11 255.255.255.0
ip router isis 1
!
interface GigabitEthernet4
description to SPINE3
ip address 10.3.13.11 255.255.255.0
ip router isis 1
!
interface GigabitEthernet5
description to POD1-SW
no ip address
!
!
!
i L1 10.1.1.1/32 [115/10] via 10.1.11.1, 00:00:18, GigabitEthernet2
i L1 10.1.21.0/24 [115/20] via 10.1.11.1, 00:00:18, GigabitEthernet2
i L1 10.1.31.0/24 [115/20] via 10.1.11.1, 00:00:18, GigabitEthernet2
i L1 10.1.41.0/24 [115/20] via 10.1.11.1, 00:00:18, GigabitEthernet2
i L1 10.2.2.2/32 [115/10] via 10.2.12.2, 00:00:40, GigabitEthernet3
i L1 10.2.22.0/24 [115/20] via 10.2.12.2, 00:00:40, GigabitEthernet3
i L1 10.2.32.0/24 [115/20] via 10.2.12.2, 00:00:40, GigabitEthernet3
i L1 10.2.42.0/24 [115/20] via 10.2.12.2, 00:00:40, GigabitEthernet3
i L1 10.3.3.3/32 [115/10] via 10.3.13.3, 13:27:57, GigabitEthernet4
i L1 10.3.23.0/24 [115/20] via 10.3.13.3, 13:27:57, GigabitEthernet4
i L1 10.3.33.0/24 [115/20] via 10.3.13.3, 13:27:57, GigabitEthernet4
i L1 10.3.43.0/24 [115/20] via 10.3.13.3, 13:27:57, GigabitEthernet4
i L1 10.12.12.12/32 [115/20] via 10.3.13.3, 00:00:18, GigabitEthernet4
[115/20] via 10.2.12.2, 00:00:18, GigabitEthernet3
[115/20] via 10.1.11.1, 00:00:18, GigabitEthernet2
i L1 10.13.13.13/32 [115/20] via 10.3.13.3, 00:00:18, GigabitEthernet4
[115/20] via 10.2.12.2, 00:00:18, GigabitEthernet3
[115/20] via 10.1.11.1, 00:00:18, GigabitEthernet2
i L1 10.14.14.14/32 [115/20] via 10.3.13.3, 00:00:18, GigabitEthernet4
[115/20] via 10.2.12.2, 00:00:18, GigabitEthernet3
[115/20] via 10.1.11.1, 00:00:18, GigabitEthernet2
Alright so we have unicast routing in place, before we start working on our VXLAN configuration, we're also going to need functional multicast routing. Now, you can use static RP, autorp, or bsr however you have to enable bidirectional PIM. Why bidirectional PIM you ask? Well, the short answer is that bidir pim was created to to answer a short coming of traditional multicast. Traditional multicast operates on the idea that there are many many more receivers than sources. However, in VXLAN we have our VTEPs are acting as both receivers and sources (a little more on this at the end). In this example I'll be using BSR to announce Spine1 as my RP. For brevity, I'll provide the configuration of only Spine1 and CSR1, however if you're testing this out in your lab please realize you'll want PIM sparse mode enabled on all your interfaces except the interface facing your clients/servers. You'll also want bi-directional pim enabled on all devices.
SPINE1#show run | i ip pim|interface
interface Loopback0
ip pim sparse-mode
interface GigabitEthernet0/1
ip pim sparse-mode
interface GigabitEthernet0/2
ip pim sparse-mode
interface GigabitEthernet0/3
ip pim sparse-mode
interface GigabitEthernet0/4
ip pim sparse-mode
ip multicast-routing
ip pim bidir-enable
ip pim bsr-candidate Loopback0 0
ip pim rp-candidate Loopback0 group-list GROUP1-MCAST bidir <-- NOTICE we're announcing this RP as a bidir RP
!
!
SPINE1#show ip access-list GROUP1-MCAST
Standard IP access list GROUP1-MCAST
10 permit 239.0.0.0, wildcard bits 0.255.255.255
##
CSR1#show run | i interface|ip pim
interface Loopback0
ip pim sparse-mode
interface GigabitEthernet2
ip pim sparse-mode
interface GigabitEthernet3
ip pim sparse-mode
interface GigabitEthernet4
ip pim sparse-mode
interface GigabitEthernet5
##This interface connects down to client, it's L2 only, hence no PIM configuration.
ip pim bidir-enable
ip multicast-routing distributed
So a couple key points there, we're enabling bi-directional pim globally with "ip pim bidir-enable" on all devices. Then on the RP, I'm telling Spine1 to announce itself as not only an RP candidate, but an RP that is supporting bi-directional PIM. I'm also filtering the groups this RP is responsible for with ACL "GROUP1-MCAST"... because I felt like it lol. The next thing I like to do is very multicast routing is working as expected. So, I'll go to CSR4, have it join an mcast group and do a simple ping test from CSR1.
CSR4(config)#int lo0
CSR4(config-if)# ip igmp join-group 239.0.0.4
!
!
### From CSR1 ###
CSR1#ping 239.0.0.4 time 1 rep 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.4, timeout is 1 seconds:
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 2 from 10.14.14.14, 27 ms
Reply to request 2 from 10.14.14.14, 31 ms
Reply to request 2 from 10.14.14.14, 31 ms
Reply to request 2 from 10.14.14.14, 31 ms
Perfect! Just to keep things clean, I removed that join from CSR4. Now onto the actual VXLAN configuration. With functional unicast and multicast routing, we only have a couple extra ingredients to make this work.
1. Network Virtualization Endpoint (NVE) Interface
2. Service Instance
3. Bridge-Domain (to tie it all together).
Luckily, the configuration is so generic, we can just copy and paste it to all our Virtual Tunnel Endpoints (VTEPs). I know... I know I'm defining every acronym lol. I don't care, I like knowing what all the acronyms actually stand for. Alright! So here's a very basic configuration I'll drop on my VTEP to create a single bridge-domain with a single VXLAN Network Identifier (VNI). This is the equivalent of configuring a single VLAN on your network. Just way cooler since connections between VTEPs are all layer 3, and hence get the benefit of ECMP.
int nve 1
source-interface lo0
member vni 47884 mcast-group 239.0.12.34
!
interface GigabitEthernet5
service instance 1 ethernet
encapsulation untagged
exit
!
bridge-domain 1
member vni 47884
member GigabitEthernet5 service-instance 1
SPINE1#show run | i ip pim|interface
interface Loopback0
ip pim sparse-mode
interface GigabitEthernet0/1
ip pim sparse-mode
interface GigabitEthernet0/2
ip pim sparse-mode
interface GigabitEthernet0/3
ip pim sparse-mode
interface GigabitEthernet0/4
ip pim sparse-mode
ip multicast-routing
ip pim bidir-enable
ip pim bsr-candidate Loopback0 0
ip pim rp-candidate Loopback0 group-list GROUP1-MCAST bidir <-- NOTICE we're announcing this RP as a bidir RP
!
!
SPINE1#show ip access-list GROUP1-MCAST
Standard IP access list GROUP1-MCAST
10 permit 239.0.0.0, wildcard bits 0.255.255.255
##
CSR1#show run | i interface|ip pim
interface Loopback0
ip pim sparse-mode
interface GigabitEthernet2
ip pim sparse-mode
interface GigabitEthernet3
ip pim sparse-mode
interface GigabitEthernet4
ip pim sparse-mode
interface GigabitEthernet5
##This interface connects down to client, it's L2 only, hence no PIM configuration.
ip pim bidir-enable
ip multicast-routing distributed
So a couple key points there, we're enabling bi-directional pim globally with "ip pim bidir-enable" on all devices. Then on the RP, I'm telling Spine1 to announce itself as not only an RP candidate, but an RP that is supporting bi-directional PIM. I'm also filtering the groups this RP is responsible for with ACL "GROUP1-MCAST"... because I felt like it lol. The next thing I like to do is very multicast routing is working as expected. So, I'll go to CSR4, have it join an mcast group and do a simple ping test from CSR1.
CSR4(config)#int lo0
CSR4(config-if)# ip igmp join-group 239.0.0.4
!
!
### From CSR1 ###
CSR1#ping 239.0.0.4 time 1 rep 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.4, timeout is 1 seconds:
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 0 from 10.14.14.14, 17 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 1 from 10.14.14.14, 25 ms
Reply to request 2 from 10.14.14.14, 27 ms
Reply to request 2 from 10.14.14.14, 31 ms
Reply to request 2 from 10.14.14.14, 31 ms
Reply to request 2 from 10.14.14.14, 31 ms
Perfect! Just to keep things clean, I removed that join from CSR4. Now onto the actual VXLAN configuration. With functional unicast and multicast routing, we only have a couple extra ingredients to make this work.
1. Network Virtualization Endpoint (NVE) Interface
2. Service Instance
3. Bridge-Domain (to tie it all together).
Luckily, the configuration is so generic, we can just copy and paste it to all our Virtual Tunnel Endpoints (VTEPs). I know... I know I'm defining every acronym lol. I don't care, I like knowing what all the acronyms actually stand for. Alright! So here's a very basic configuration I'll drop on my VTEP to create a single bridge-domain with a single VXLAN Network Identifier (VNI). This is the equivalent of configuring a single VLAN on your network. Just way cooler since connections between VTEPs are all layer 3, and hence get the benefit of ECMP.
int nve 1
source-interface lo0
member vni 47884 mcast-group 239.0.12.34
!
interface GigabitEthernet5
service instance 1 ethernet
encapsulation untagged
exit
!
bridge-domain 1
member vni 47884
member GigabitEthernet5 service-instance 1
So that bit of configuration is basically saying "Any untagged traffic received on Gig5 is part of service instance 1. Service instance 1 is part of bridge-domain 1, as is VNI 47884. VNI 47884 is using multicast-group 239.0.12.34." Now, after apply this configuration to all our VTEPs (the CSR1Kvs for this lab), we can doing a couple test pings from our servers.
cisco@server-1:~$ ip add sh eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether fa:16:3e:08:a7:4a brd ff:ff:ff:ff:ff:ff
inet 192.168.0.1/24 brd 192.168.0.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe08:a74a/64 scope link
valid_lft forever preferred_lft forever
cisco@server-1:~$ ping 192.168.0.2 -c 3
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=8.61 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=6.29 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=5.36 ms
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 5.366/6.757/8.616/1.368 ms
cisco@server-1:~$ ping 192.168.0.3 -c 3
PING 192.168.0.3 (192.168.0.3) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=5.38 ms
64 bytes from 192.168.0.3: icmp_seq=2 ttl=64 time=5.21 ms
64 bytes from 192.168.0.3: icmp_seq=3 ttl=64 time=4.34 ms
--- 192.168.0.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 4.340/4.981/5.389/0.462 ms
cisco@server-1:~$ ping 192.168.0.4 -c 3
PING 192.168.0.4 (192.168.0.4) 56(84) bytes of data.
64 bytes from 192.168.0.4: icmp_seq=1 ttl=64 time=5.83 ms
64 bytes from 192.168.0.4: icmp_seq=2 ttl=64 time=5.44 ms
64 bytes from 192.168.0.4: icmp_seq=3 ttl=64 time=7.29 ms
--- 192.168.0.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 5.441/6.191/7.296/0.797 ms
Also, before we continue our discussion and I finally tell you WHY we have to use multicast, let's take a look bridge-domain 1 on CSR1 and see what it looks like.
CSR1#show bridge-domain 1
Bridge-domain 1 (2 ports in all)
State: UP Mac learning: Enabled
Aging-Timer: 300 second(s)
GigabitEthernet5 service instance 1
vni 47884
AED MAC address Policy Tag Age Pseudoport
0 FA16.3E0A.1FD8 forward dynamic 214 nve1.VNI47884, VxLAN
src: 10.11.11.11 dst: 10.12.12.12
0 FA16.3E84.952A forward dynamic 219 nve1.VNI47884, VxLAN
src: 10.11.11.11 dst: 10.13.13.13
0 FA16.3EA9.7EDF forward dynamic 223 nve1.VNI47884, VxLAN
src: 10.11.11.11 dst: 10.14.14.14
0 FA16.3E08.A74A forward dynamic 223 GigabitEthernet5.EFP1
So bridge-domain 1 is storing mac-addresses for local and remote hosts. Notice we're mapping Server2 - 4 mac addresses to not just a VNI, but we also store the other VTEP's loopbacks as part of that mapping (look at src: 10.11.11.11 dst: 10.xx.xx.xx). How did we learn that information???
Well to get a better view of this, I shutdown interfaces Gi3 - 4 on CSR1 forcing all traffic through Gi2 (connected to Spine1). I also cleared the mac address table on the bridge-domain (clear bridge-domain 1 mac table) and cleared arp entries from the host. Then I setup up a packet capture on Gi2, and re-ran my ping from server 1 (192.168.0.1) to server 2 (192.168.0.2). Check out this capture, see if you can wrap your mind around it before I attempt to explain.
So, if you're reading carefully you'll notice that ARP request is actually being sent to the multicast address of 239.0.12.34. Which is exactly the reason we're using multicast to support VXLAN. Multicast is used for all unknown unicast and broadcast traffic. With the case of ARP, the VTEP will actually learn which other VTEP has that host connected in the exact same fashion a switch learns mac addresses. From then on when server 1 sends frames to server 2, communication is unicast between CSR1 and CSR2. **A quick additional note: You can also see that VXLAN is using UDP encapsulation. Just incase you ever hear it referred to as mac-in-udp routing.**
So in summary, VXLAN not only gives us over 16 million segments (compared to 4094 if you used VLANs only), all the benefits of having a routed network (much more scalable, and support of ECMP), but it also carries over traditional ethernet concepts with it. I'll endeavor to do a follow up post in the future to look at running multiple VNIs. Until then, happy networking you packet pushers you.
So in summary, VXLAN not only gives us over 16 million segments (compared to 4094 if you used VLANs only), all the benefits of having a routed network (much more scalable, and support of ECMP), but it also carries over traditional ethernet concepts with it. I'll endeavor to do a follow up post in the future to look at running multiple VNIs. Until then, happy networking you packet pushers you.