~tore Tore Anderson's technology blog

IPv6 network boot with UEFI and iPXE

Here at Redpill Linpro we make extensive use of network booting to provision software onto our servers. Many of our servers don’t even have local storage - they boot from the network every time they start up. Others use network boot in order to install an operating system to local storage. The days when we were running around in our data centres with USB or optical install media are long gone, and we’re definitively not looking back.

Our network boot infrastructure is currently built around iPXE, a very flexible network boot firmware with powerful scripting functionality. Our virtual servers (using QEMU/KVM) simply execute iPXE directly. Our physical servers, on the other hand, use their standard built-in PXE ROMs in order to chainload an iPXE UNDI ROM over the network.

IPv6 PXE was first included in UEFI version 2.3 (Errata D), published five years ago. However, not all servers support IPv6 PXE yet, including the ageing ones in my lab. I’ll therefore focus on virtual servers for now, and will get back to IPv6 PXE on physical servers later.

Enabling IPv6 support in iPXE

At the time of writing, iPXE does not enable IPv6 support by default. This default spills over into Linux distributions like Fedora. I’m trying to get this changed, but for now it is necessary to manually rebuild iPXE with IPv6 support enabled.

This is done by downloading the iPXE sources and then enabling NET_PROTO_IPV6 in src/config/general.h. Replace #undef with #define so that the full line reads #define NET_PROTO_IPV6.

At this point, we’re ready to build iPXE. For the virtio-net driver used by our QEMU/KVM hypervisors, the correct command is make -C /path/to/ipxe/src bin/1af41000.rom. To build a UEFI image suitable for chainloading, run make -C /path/to/ipxe/src bin-x86_64-efi/ipxe.efi instead.

On RHEL7-based hypervisors, upgrading iPXE is just a matter of replacing the default 1af41000.rom file in /usr/share/ipxe with the one that was just built.

Network configuration

The network must be set up with both ICMPv6 Router Advertisements (RAs) and DHCPv6. RAs are necessary in order to provision the booting nodes with a default IPv6 router, while DHCPv6 is the only way to advertise IPv6 network boot options.

When it comes to the assignment of IPv6 addresses, you can use either SLAAC or DHCPv6 IA_NA. iPXE supports both approaches. Avoid using both at the same time, though, as doing so may trigger a bug which could lead to the boot process getting stuck halfway through.

You’ll probably want to provision the nodes with an IPv6 DNS server. This can be done both using DHCPv6 and ICMPv6 RAs. iPXE supports both approaches, so either will do just fine. That said, I recommend enabling both at the same time. It might very well be that some UEFI implementation only supports one of them.

ICMPv6 Router Advertisement configuration

protocol radv {
  # Use Google's public DNS server.
  rdnss {
    ns 2001:4860:4860::8888;
  };
  interface "vlan123" {
    managed no;       # Addresses (IA_NA) aren't found in DHCPv6
    other config yes; # "Other Configuration" is found in DHCPv6 
    prefix 2001:db8::/64 {
      onlink yes;     # The prefix is on-link
      autonomous yes; # The prefix may be used for SLAAC
    };
  };
}

The configuration above is for BIRD. It is all pretty standard stuff, but pay attention to the fact that the other config flag is enabled. This is required in order to make iPXE ask the DHCPv6 server for the Boot File URL Option.

DHCPv6 server configuration

option dhcp6.user-class code 15 = string;
option dhcp6.bootfile-url code 59 = string;
option dhcp6.client-arch-type code 61 = array of unsigned integer 16;

option dhcp6.name-servers 2001:4860:4860::8888;

if exists dhcp6.client-arch-type and
   option dhcp6.client-arch-type = 00:07 {
    option dhcp6.bootfile-url "tftp://[2001:db8::69]/ipxe.efi";
} else if exists dhcp6.user-class and
          substring(option dhcp6.user-class, 2, 4) = "iPXE" {
    option dhcp6.bootfile-url "http://boot.ipxe.org/demo/boot.php";
}

subnet6 2001:db8::/64 {}

The config above is for the ISC DHCPv6 server. The first paragraph declares the various necessary DHCPv6 options and their syntax. For some reason, ISC dhcpd does not appear to have any intrinsic knowledge of these, even though they’re standardised.

The second paragraph ensures the server can advertise an IPv6 DNS server to clients. In this example I’m using Google’s Public DNS; you’ll probably want to replace it with your own IPv6 DNS server.

The if/else statement ensures two things:

  1. If the client is an UEFI firmware performing IPv6 PXE, then we just chainload an UEFI-compatible iPXE image. (As I mentioned earlier, I haven’t been able to fully test this config due to lack of lab equipment supporting IPv6 PXE.)
  2. If the client is iPXE, then we give it an iPXE script to execute. In this example, I’m using the iPXE project’s demo service, which boots a very basic Linux system.

Finally, I declare the subnet prefix where the IPv6-only VMs live. Without this, the DHCPv6 server will not answer any requests coming from this network. Since I’m not using stateful address assignment (DHCPv6 IA_NA), I do not need to configure an IPv6 address pool.

Conclusion

Thanks to iPXE and UEFI, network boot can be made to work just as well over IPv6 as over IPv4. The only real remaining problem is that many server models still lack support for IPv6 PXE, but I am assuming this will become less of an issue over time as they upgrade their UEFI implementations to version 2.3 (Errata D) or newer.

In virtualised environments, nothing is missing. Apart from the somewhat annoying requirement to rebuild iPXE to enable IPv6 support, it Just Works. This is evident from by the boot log below, which shows a successful boot of a QEMU/KVM virtual machine residing on an IPv6-only network.

[root@kvmhost ~]# virsh create /etc/libvirt/qemu/v6only --console
Domene v6only opprettet fra /etc/libvirt/qemu/v6only
Connected to domain v6only
Escape character is ^]

Google, Inc.
Serial Graphics Adapter 06/09/14
SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ (mockbuild@) Mon Jun  9 21:33:48 UTC 2014
4 0

SeaBIOS (version seabios-1.7.5-8.el7)
Machine UUID ebe11d4a-11d4-4ae8-b249-390cdf7c79ec

iPXE (http://ipxe.org) 00:03.0 CA00 PCI2.10 PnP PMM+7FF979E0+7FEF79E0 CA00

Booting from Hard Disk...
Boot failed: not a bootable disk

Booting from ROM...
iPXE (PCI 00:03.0) starting execution...ok
iPXE initialising devices...ok

iPXE 1.0.0+ (f92f) -- Open Source Network Boot Firmware -- http://ipxe.org
Features: DNS HTTP iSCSI TFTP AoE ELF MBOOT PXE bzImage Menu PXEXT

net0: 00:16:3e:c2:16:b7 using virtio-net on PCI00:03.0 (open)
  [Link:up, TX:0 TXE:0 RX:0 RXE:0]
Configuring (net0 00:16:3e:c2:16:b7).................. ok
net0: fe80::216:3eff:fec2:16b7/64
net0: 2001:db8::216:3eff:fec2:16b7/64 gw fe80::21e:68ff:fed9:d156
Filename: http://boot.ipxe.org/demo/boot.php
http://boot.ipxe.org/demo/boot.php.......... ok
boot.php : 127 bytes [script]
/vmlinuz-3.16.0-rc4... ok
/initrd.img... ok
Probing EDD (edd=off to disable)... ok

iPXE Boot Demonstration
=======================

Linux (none) 3.16.0-rc4+ #1 SMP Wed Jul 9 15:44:09 BST 2014 x86_64 unknown

Congratulations!  You have successfully booted the iPXE demonstration
image from http://boot.ipxe.org/demo/boot.php

See http://ipxe.org for more ideas on how to use iPXE.

root:/#

Evaluating DHCPv6 relays

One of the few remaining IPv4-only services here at Redpill Linpro is our provisioning infrastructure, which is based on PXE network booting. I’ve long wanted to do something about that. Now that more and more servers are shipping with UEFI support, I am finally in a position to start looking at it.

I’m starting out by figuring out which DHCPv6 relay implementation we’ll be using. This post details my evaluation process, conclusion, and choice.

Network topology

Most of the servers we want to provision are usually located in a dedicated customer VLAN that is connected to a set of redundant routers running Linux. The routers speak VRRP on each of the VLANs in order to decide which router is the primary one serving the VLAN in question.

Furthermore, each of the routers have multiple redundant uplinks to the core network, and are using a dynamic IGP to ensure optimal routing and fault tolerance. The DHCPv6 server is reached through the core network using unicast routing.

The following figure illustrates the topology:

Our desired capabilities

Our current IPv4-only network boot infrastructure relies heavily on using Ethernet MAC addresses to distinguish between clients. Being able to continue to do so will make the introduction of IPv6 support quick and easy. We would therefore like for the implementation to support the DHCPv6 Client Link-Layer Address Option.

As discussed in the previous section, the network configuration on the routers is dynamic and could change without notice. An ideal DHCPv6 relay implementation would be able to notice such changes and automatically adapt to the new environment. In our environment, this would mean being able to cope with:

  • A new VLAN interface showing up, e.g., when we provision a new customer.
  • The IP address configuration on a VLAN interface changing, e.g., due to a VRRP fail-over event.
  • The route to the DHCPv6 server changing from one uplink interface to another, e.g., due to changed route metrics in our core network.

Finally, regarding the software itself, we’d like for it to be:

Available implementations and their capabilities

From what I was able to determine, there are four available open-source DHCPv6 relay implementations. These are, in alphabetical order:

The versions I tested are shown in parenthesis.

Of the tested implementations, only Dibbler supported this feature. It is enabled by adding the line option link-layer in the configuration file.

Desired capability 2: Detecting new interfaces on the fly

Disappointingly enough, none of the tested implementations were able to do this. dhcpv6 will by default listen on all available interfaces, but it does not detect new interfaces showing up after it has started.

The other three implementations all require that the listening interfaces be configured explicitly.

Desired capability 3: Detecting IPv6 addresss changing during runtime

Only WIDE-DHCPv6 was able to do this. It appears to check what the local address on the interface is every time it relays a packet, so it always sets the link address field in the relayed DHCPv6 packet correctly.

The other three implementations read in the global address (or lack thereof) for each interface when they start, and do not notice any changes. Thus, there is a risk that the link address field in their relayed packets is set incorrectly.

Desired capability 4: Coping with route to DHCPv6 server changing

Only dhcpv6 supports this without any weirdness. The address of the DHCPv6 server is specified with the -su command line option, and packets are relayed to it using a standard routing lookup.

ISC DHCP and WIDE-DHCPv6 behave in a rather bizarre way. They both require that the interface facing the DHCPv6 server is explicitly specified on the command line, but for some reason they completely ignore it and instead use a standard routing lookup to reach the server.

Dibbler also requires that the upstream interfaces are explicitly configured. If there is no route to the DHCPv6 server on one of these interfaces, it will log the following error for each DHCPv6 request:

Low-level layer error message: Unable to send data (dst addr: 2001:db8::d)
Failed to send data to server unicast address.

That said, it is possible to simply configure both eth0 and eth1 as upstream interfaces. This will ensure all requests are correctly relayed regardless of which interface has the active route to the DHCPv6 server. That said, I’m only awarding half a point to Dibbler here, both due to the clunkyness of the workaround and the constant stream of error messages it will result in.

Desired capability 5: Free and open-source software

Yes! Every tested implementation qualifies.

Desired capability 6: Actively maintained

Only Dibbler and ISC DHCP appear to be. According to its own homepage, dhcpv6 was discontinued in 2009. WIDE-DHCPv6 has not seen any release since 2008.

Desired capability 7: Available in Ubuntu’s software archive

Only dhcpv6 is missing, the rest are an apt-get install away.

Conclusion

Out of a maximum 7 points, the final scores are as follows:

  1. Dibbler: 4.5 points
  2. ISC DHCP: 4 points
  3. WIDE-DHCPv6: 4 points
  4. dhcpv6: 2 points

Disappointingly enough, none of them are able to run continously in a dynamic environment like ours. To work around this, we’ll probably have to devise a system that automatically generates new configuration and restarts the relay whenever a network configuration change is detected. Should a DHCPv6 request arrive exactly when the relay is being restarted, it will likely be retried within seconds, so this is extremely unlikely to cause any operational issues.

This workaround will handle the lack of desired capabilities 2 through 4. After disregarding these, only Dibbler gets full score (due to its support for the Client Link-Layer Address Option). Dibbler is thus the obvious choice.

SIIT-DC support in Varnish Cache through libvmod-rfc6052

Here at Redpill Linpro we’re big fans of the Varnish Cache. We tend to put Varnish in front of almost every web site that we operate for our customers, which goes a long way toward ensuring that they respond blazingly fast - even though the applications themselves might not always be designed with speed or scalability in mind.

We’re also big fans of IPv6, which we have deployed throughout our entire network infrastructure. We’ve also pioneered a technology called SIIT-DC, which has undergone peer review in the IETF and will likely be published as an RFC any day now. SIIT-DC allows us to operate our data centre applications using exclusively IPv6, while at the same time ensuring that they remain available from the IPv4 Internet without any performance or functionality loss.

A quick introduction to SIIT-DC

SIIT-DC works by embedding the 32-bit IPv4 source address of the client into an IPv6 address. The resulting IPv6 address is located within a 96-bit translation prefix. 96 + 32= 128, the number of bits of an IPv6 address. It is easiest to explain with an example:

Assume an IPv4-only client with the address 198.51.100.42 makes an HTTP request to a web site hosted in an IPv6-only data centre. The client’s initial IPv4 packet will be routed to the nearest SIIT-DC Border Relay, which will translate the packet to IPv6. If we assume that the translation prefix in use is 64:ff9b::/96, the resulting IPv6 packet will have a source address of 64:ff9b::c633:642a. (An alternative way of representing this address is 64:ff9b::198.51.100.42, by the way.)

The translated IPv6 packet then gets routed through the IPv6 data centre network until it reaches the web site’s Varnish Cache. Varnish responds to it as it would with any other native IPv6 packet. The response gets routed to the nearest SIIT-DC Border Relay, where it gets translated back to IPv4 and finally routed back to the IPv4-only client. There is full bi-directional connectivity between the IPv4-only client and the IPv6-only server, allowing the HTTP request to complete successfully.

That’s the gist of it, anyway. If you’d like to learn more about SIIT-DC, you should start out by watching the this presentation about it held at the RIPE69 conference in London last November.

What’s the problem, then?

From Varnish’s point of view, the translated IPv4 client looks the same as a native IPv6 one. SIIT-DC hides the fact that the client is in reality using IPv4. The implication is that the VCL variable client.ip will contain the IPv6 address 64:ff9b::c633:642a, instead of the IPv4 address 198.51.100.42.

If you don’t use the client.ip variable for anything, then there’s no problem at all. If, on the other, hand you do use client.ip for something, and that something expects to work on literal IPv4 addresses, then there’s a problem. For example, a IP geolocation library is unlikely to return anything useful when given an IPv6 address such as 64:ff9b::c633:642a to locate.

The solution: libvmod-rfc6052

Even though our example 64:ff9b::c633:642a looks nothing like an IPv4 address, it’s important to realise that the original IPv4 address is still there - it’s just hidden in last 32 bits of the IPv6 address, i.e., in the 0xc633642a hexadecimal number.

So all we need to do is to extract those 32 bits and transform them back to a regular IPv4 address. Doing just that is exactly the purpose of libvmod-rfc6052. It is a new Varnish Module that extends VCL with a set of functions that:

  • Checks if a Varnish sockaddr data structure (VSA) (e.g., client.ip) contains a so-called IPv4-embedded IPv6 address (cf. RFC6052 section 2.2).
  • Extracts the embedded IPv4 address from an IPv6 VSA, returning a new IPv4 VSA containing the embedded IPv4 address.
  • Performs an in-place substitution of an IPv6 VSA containing an IPv4-embedded IPv6 address with a new IPv4 VSA containing the embedded IPv4 address.

The following example VCL code shows how these functions can be used to insert an X-Forwarded-For HTTP header into the request. The use of libvmod-6052 ensures that the backend server will only ever see native IPv4 and IPv6 addreses.

import rfc6052;

sub vcl_init {
    # Set a custom translation prefix (/96 is implied).
    # Default: 64:ff9b::/96 (see RFC6052 section 2.1).
    rfc6052.prefix("2001:db8:46::");
}

sub vcl_recv {
    ###
    ### Alternative A: use rfc6052.extract().
    ### This leaves the "client.ip" variable intact.
    ###

    if(rfc6052.is_v4embedded(client.ip)) {
        # "client.ip" contains an RFC6052 IPv4-embedded IPv6
        # address. Set XFF to the embedded IPv4 address:
        set req.http.X-Forwarded-For = rfc6052.extract(client.ip);
    } else {
        # "client.ip" contained an IPv4 address, or a native
        # (non-RFC6052) IPv6 address. No RFC6052 extraction
        # necessary, we can just set XFF directly:
        set req.http.X-Forwarded-For = client.ip;
    }

    ##############################################################

    ###
    ### Alternative B: Use replace() to change the
    ### value of "client.ip" before setting XFF.
    ###

    rfc6052.replace(client.ip);

    # If "client.ip" originally contained an IPv4-embedded
    # IPv6 address, it will now contain just the IPv4 address.
    # Otherwise, replace() did no changes, and "client.ip"
    # still contains its original value. In any case, we can
    # now be certain that "client.ip" no longer contains an
    # IPv4-embedded IPv6 address.

    set req.http.X-Forwarded-For = client.ip;
}

We’re naturally providing libvmod-rfc6052 as FOSS software in the hope that it will be useful to the Varnish and IPv6 communities.

If you try it out, don’t hesitate to share your experiences with us. Should you stumble across any bugs or have any suggestions, head over to the libvmod-rfc6052 GitHub repo and submit an issue.

Making a Homenet router out of OpenWrt

This post provides a step-by-step guide on how to take a residential gateway running OpenWrt, installing the software from the Hnet project on it, and finally converting it to be a full-fledged Homenet router. The post is quite long, but don’t let that put you off - it’s only because I go through the process in minute detail. The entire conversion process shouldn’t take you more than 10-15 minutes.

Unlike the Hnet project’s own setup instructions, I’ll use only the OpenWrt web interface LuCI, and provide plenty of screenshots. That way I’m hoping to help make Hnet and Homenet a little bit more accessible to users who might not be too comfortable working with OpenWrt’s command line interface.

If you don’t know what Homenet is or why you would want your residential gateway to be a Homenet router in first place, I suggest you go read my previous post, where I introduced Homenet.

Prerequisites

First of all, you’ll need a residential gateway with OpenWrt 15.05 Chaos Calmer installed. Consult OpenWrt’s Table of Hardware to check if your device is supported. The device page linked from the table should contain installation instructions. The router I’ll be using in this walk-through is a Netgear WNDR3700v2.

Second, I recommend that you use a laptop with both wired and wireless connectivity. That way, you can safely reconfigure the router’s wired ports while connected with wireless and vice versa, thus greatly reducing the risk of inadvertently locking yourself out of your device. While it’s certainly possible to make do without, the following instructions will assume that you’re using such a laptop.

If you end up locking yourself out of your device anyway, look for a physical button labelled Reset or something similar. You can usually make the router revert to its default configuration by keeping it pressed for 10-15 seconds. If that doesn’t work out for you, consult the documentation on OpenWrt’s failsafe mode.

The home network: bridged or routed?

Like most residential gateways, OpenWrt comes by default with a single logical LAN interface which is just a layer-2 bridge consisting of all the wired and wireless interfaces in the router, excluding the WAN interface. That is however not how a Homenet router is meant to operate. In Homenet, layer-3 is king: each interface has its own isolated network segment, complete with its own IP prefixes. Standard layer-3 routing is used whenever hosts on different segments need to communicate.

In this guide, I’ll set it up the proper Homenet way. That said, a Homenet router also supports traditional bridged LAN segments. This approach can be used if you want to keep full layer-2 connectivity between all the hosts in your home network.

Step 1: Install the Hnet software suite

OpenWrt 15.05 Chaos Calmer doesn’t come with the Hnet software installed by default, so our first order of business is to install it from the Internet. Connect your router’s WAN port to an Internet-connected network (such as directly to your ISP or your pre-existing home LAN), and connect your laptop to one of the router’s LAN ports using wired Ethernet. You should now be able to access LuCI, OpenWrt’s web interface, at http://openwrt.lan:

Log in as root without a password. It will take you to LuCI’s Status/Overview page:

It insists that you set a password. This can be done on the System/Administration page.

If you want, you can now visit Network/Interfaces to verify that the router’s WAN interface has been automatically configured:

Head to System/Software and click the Update lists button to refresh the list of software available for download:

When that has completed, download and install the ipset package and then hnet-full. (I’m not 100% certain that ipset is strictly necessary, but you’ll get a warning when installing hnet-full if ipset isn’t already installed.)

After the installation of hnet-full, all the software necessary for Homenet operation is installed. It is now necessary to reboot the router in order for the software to become fully operational (this is probably a bug). You can do so from the System/Reboot page.

Step 2: Disable the non-Homenet ULA prefix handling

After the reboot, head to Network/Interfaces. Near the bottom of the page there’s a text field labelled IPv6 ULA-Prefix that contains an auto-generated prefix. Empty this text field and then click Save & Apply:

Why is this necessary? Hnet generates and maintains its own ULA prefixes independently of the IPv6 ULA-Prefix setting. However, due to a bug, Homenet interfaces created in LuCI will end up with two ULA prefixes assigned; the native Homenet-maintained one in addition to the non-Homenet one specified in IPv6 ULA-Prefix. Removing the non-Homenet setting successfully works around this bug.

Step 3: Convert the WAN interface to Homenet

Stay on the Network/Interfaces page and make a note of which physical port the default WAN and WAN6 interfaces are using (eth1 in my case), then click their Delete buttons to remove them. You should now be left only with the default non-Homenet LAN interface:

Once they’re gone, click Add new interface…. You can give it any name you want, except for LAN, WAN, or WAN6. Hnet will automatically detect the role of an interface as long as it does not have any of those special names. (I’m calling mine e0, short for Ethernet port 0.) Choose the protocol Automatic Homenet (HNCP), set it to cover the same physical interface as the old WAN/WAN6 interface did, and finally click Submit and then Save & Apply.

If everything went well, you should be returned to the interface list, and after a few seconds your new Homenet interface should show as having acquired connectivity from the upstream network:

Step 4: Convert the wireless interfaces to Homenet

We’ll first need to remove the wireless interfaces from the default non-Homenet LAN bridge. Hit Edit on the row with the LAN interface, and go to the Physical Settings tab. Remove the tick in the checkbox next to any wireless interface you see, and click Save & Apply.

Next, head to Network/Wifi to see the list of wireless interfaces:

Click the Edit button for one of the wireless interfaces, tick the checkbox next to create: and give the new interface a name. You can use any name you want, except for LAN/WAN/WAN6 as discussed above. You might also want to take some time to explore the various tabs here in order to configure security and encryption, wireless band and channel, country, and so on.

If your device has multiple wireless interfaces, I strongly suggest that you also give them different ESSIDs. This is because most wireless clients will assume that all access points using the same ESSID connect to the same layer-2 segment. That’s not the case in Homenet, so if a client roams from one AP to another, it might experience connectivity issues. Using differing ESSIDs will prevent this from occurring.

When you’re happy with the setup, click Save & Apply. Repeat the process for all the wireless interfaces in the list. The final step is to click each interface’s Enable button in the interface list to turn on the radio:

Head back to Network/Interfaces. You should see the new wireless interfaces in the list:

Click Edit for one of them. On the next page, set the protocol to Automatic Homenet (HNCP) and click Switch protocol and then Save & Apply.

Repeat the process for any other wireless interfaces in the list. When you’re done, the interfaces should all have been configured with IP addresses:

At this point, disconnect your laptop’s Ethernet cable and connect to one of the ESSIDs you just created. If it works, congratulations! Your laptop is now connected to a Homenet-handled network segment. From now on you’ll need to access LuCI at http://openwrt.home (note that the domain suffix has changed).

Step 5: Create per-port VLANs in the embedded switch

(If you’re going for a traditional bridged layer-2 home network, you can skip this section.)

The external LAN ports on my router are connected to an embedded Ethernet switch, which in turn has a single interface connected to the “CPU” where OpenWrt runs. The following figure from the OpenWrt Wiki illustrates the architecture:

I’ll use VLANs to make each of the four external LAN ports their own Homenet interface. This is done on the Network/Switch page. My WNDR3700v2’s default configuration contains only a single VLAN:

My new configuration consists of four VLANs, one for each external LAN port. Each of the VLANs is set up as untagged for its associated external LAN port, tagged for the CPU port, and off for all other ports. I’ve opted to give each VLAN the same ID as the number of its associated external LAN port, but this is just a matter of preference - at the end of the day, it doesn’t matter which values the VLAN IDs are set to. When you’re happy, click Save & Apply.

Step 5: Create Homenet interfaces for the LAN ports

Return to Network/Interfaces. Delete the old LAN interface the same way you did with WAN and WAN6. Now you should only be left with the new Homenet interfaces you’ve created so far:

What now remains to be done is to create Homenet interfaces for each of the VLANs in the switch. This is done in the same way I created the e0 interface earlier; first, click Add new interface…. In the next view, give it a name of your liking (except LAN/WAN/WAN6), choose the Automatic Homenet (HNCP) protocol, set it to cover one of the VLAN interfaces you just created, and click Submit and then Save & Apply.

Repeat this procedure for each of the VLANs configured in the embedded switch. When you’re done, the list on the Network/Interfaces should look something like this:

Mission accomplished!

Congratulations! Your router is now a pure Homenet router. Head to Status/Homenet to see a dynamically updated graph of your Homenet topology:

Of course, with only a single Homenet router this graph isn’t extremely interesting, but at least it should show your router, its interfaces, and any IP prefixes it has been assigned. You can click on various nodes in the graph to get more details in JSON format.

If you followed my advice on interface naming, your router’s interfaces should no longer have pre-determined roles such as WAN or LAN. You may, for example, connect your upstream Internet connection to the port labelled LAN 3 and a regular host to the port labelled WAN - it will work just as fine as the other way around. This ability alone will give Homenet an unprecedented level of «plug&play-ness» compared to the regular residential gateways in sale today.

If you own several residential gateways supported by OpenWrt, try converting them all to Homenet routers and connect them to each other in arbitrary ways - including via wireless. They’ll automatically discover each other and form a coherent network topology, which the Status/Homenet graph will reflect in seconds. I’ve tried this and it Just Works. Good-bye, IPv4 NAT stacking and DHCPv6-PD cascading! You shall not be missed. That said, hosts or non-Homenet routers connecting to the Homenet will be granted a DHCPv6 Prefix Delegation if they ask for one.

It is also possible to connect your Homenet to multiple ISPs, and it should all Just Work, even if the ISPs are connected to different Homenet routers. Well, in theory anyway - I haven’t yet tested Homenet with multiple ISPs myself. If you do, please let me know how it worked out. You can reach me and the Hnet team itself at #hnet-hackers at freenode.

Cisco UCS, multi-queue NICs, and RSS

The other day one of my colleagues at Redpill Linpro asked me to help him figure out why a web cache server started responding slowly during a traffic peak. My colleague was scratching his head over the problem, because although the traffic level was unusually high for the server in question, it was nowhere close to saturating the server’s 10 Gb/s of available network bandwidth:

The server was running the Varnish Cache, a truly excellent piece of software which, when running on modern hardware, will easily serve 10 Gb/s of web traffic without breaking a sweat. The CPU graph confirmed that lack of processing capacity had not been an issue; the server in question, a Cisco UCS B200 M3, had been mostly idling during the problematic period:

In spite of the above, another graph gave a significant clue as to what was going on - the network interface had been dropping quite a few inbound packets:

That certainly explained the slowness - dropped packets lead to TCP timeouts and subsequent retransmissions, which will be rather damaging to interactive and latency-sensitive application protocols such as HTTP. My colleague had correctly identifed what had happened - the remaining question was why?

Diagnosing the root cause of the dropped packets

Checking the output from the diagnostic commands ip -s -s link show dev eth5 and ethtool -S eth5 on the server in question revealed that every single one of the dropped packets were missed due to rx_no_bufs. In other words, inbound packets had been arriving faster than the server had been able to process them.

Taking a closer look at the CPU graph revealed a subtle hint: the softirq field had exceeded 100%. While it is not possible to tell with certainty from the aggregated graph, this could mean that a single one of the server’s 40 CPU cores had been completely busy processing software interrupts - which happens to be the where incoming network packets are processed. (If you’re interested in learning more about Linux’ software interrupt mechanism, take a look at this LWN article.)

I then checked how many interrupts the network adapter had:

tore@ucstest:~$ awk '/eth5/ {print $NF}' /proc/interrupts 
eth5-rx-0
eth5-tx-0
eth5-err
eth5-notify

Only a single receive queue! In other words, the server’s network adapter did not appear to be multi-queue NIC. This in turn meant that every incoming packet during the problematic period would have been processed by single CPU core. This CPU core would in all likelihood have been completely overloaded, while all the other 39 CPU cores were just sitting there with almost nothing to do.

Enabling multiple queues and Receive-side Scaling

It fortunately turned out that the network adapter in question, a Cisco UCS VIC 1240, is a multi-queue NIC - but this functionality is for some unfathomable reason disabled by the default ethernet adapter policy:

ucs1-osl3-B# scope org
ucs1-osl3-B /org # enter eth-policy default
ucs1-osl3-B /org/eth-policy # show expand 

Eth Adapter Policy:
    Name: default

    ARFS:
        Accelarated Receive Flow Steering: Disabled

    Ethernet Completion Queue:
        Count: 2

    Ethernet Failback:
        Timeout (sec): 5

    Ethernet Interrupt:
        Coalescing Time (us): 125
        Coalescing Type: Min
        Count: 4
        Driver Interrupt Mode: MSI-X

    NVGRE:
        NVGRE: Disabled

    Ethernet Offload:
        Large Receive: Enabled
        TCP Segment: Enabled
        TCP Rx Checksum: Enabled
        TCP Tx Checksum: Enabled

    Ethernet Receive Queue:
        Count: 1   <------ only 1 receive queue configured!
        Ring Size: 512

    VXLAN:
        VXLAN: Disabled

    Ethernet Transmit Queue:
        Count: 1   <------ only 1 transmit queue configured!
        Ring Size: 256

    RSS:
        Receive Side Scaling: Disabled

These settings can also be seen (and changed) in the UCS Manager GUI, under Servers -> Policies -> Adapter Policies:

Fortunately, it was possible to improve matters by simply changing the ethernet adapter policy. Hardware in a Cisco UCS environment can take on different personalities based on software configuration, and the number of queues in a network adapter is no exception. The below commands shows how you can increase both the number of receive and transmit queues:

ucs1-osl3-B# scope org
ucs1-osl3-B /org # enter eth-policy default
ucs1-osl3-B /org/eth-policy # set recv-queue count 8
ucs1-osl3-B /org/eth-policy* # set trans-queue count 8

However, in order to actually make use of the multiple receive queues, it is also necessary to enable Receive-side scaling (RSS). RSS is what ensures that the network adapter will uniformly distribute incoming packets across its multiple receive queues, which in turn are routed to separate CPU cores. In addition, it is necessary to configure the number of completion queues to the sum of configured receive and transmit queues, and the number of interrupts to the number of completion queues plus 2:

ucs1-osl3-B /org/eth-policy* # set rss receivesidescaling enabled 
ucs1-osl3-B /org/eth-policy* # set comp-queue count 16
ucs1-osl3-B /org/eth-policy* # set interrupt count 18

One might stop at this point to wonder why one has to explicitly enable RSS when recv-queue count is configured to more than 1, and similarly why the values for comp-queue count and interrupt count must be explicitly set intead of being automatically calculated. I have no idea. It is what it is.

Finally, I also noticed that Accelerated Receive Flow Steering (ARFS) is supported, but not enabled by default. Reading about it (and RFS in general), it seems to me that ARFS is also something that you really want by default if you care about performance. Thus:

ucs1-osl3-B /org/eth-policy # set arfs accelaratedrfs enabled 

(Yes, accelaratedrfs is the spelling expected by the UCS CLI.)

Activating the changes at this point is only a matter of issuing the standard commit-buffer command. That said, do be aware that a reboot will be required to activate these changes, which in turn means that any service profile that’s using this ethernet adapter policy and has a maintenance policy set to immediate will instantly reboot.

After the reboot, we can see that the ethernet adapter now has the requested number of queues and interrupts available:

tore@ucstest:~$ awk '/eth5/ {print $NF}' /proc/interrupts 
eth5-rx-0
eth5-rx-1
eth5-rx-2
eth5-rx-3
eth5-rx-4
eth5-rx-5
eth5-rx-6
eth5-rx-7
eth5-tx-0
eth5-tx-1
eth5-tx-2
eth5-tx-3
eth5-tx-4
eth5-tx-5
eth5-tx-6
eth5-tx-7
eth5-err
eth5-notify

Problem solved! The server is now much better prepared to deal with the next traffic peak, as inbound traffic will now be distributed across eight CPU cores intead of just one. I expect that the server’s 10 Gb/s of available network bandwidth will be saturated with outbound traffic long before the rate of incoming packets would become a bottleneck.

Note that it’s also important to ensure that the irqbalance daemon is running. Without it, all eight eth5-rx-* interrupts could potentially end up being routed to the same CPU core anyway, which would mean we’ve gained absolutely nothing. Fortunately, irqbalance is enabled by default on most Linux distributions.

Regarding hardware limitations

You might wonder why I enabled only eight queues for each direction, given that the blade in question has 40 CPU cores. Well, I did try to enable more, and while it is indeed possible to configure up to a maximum of 256 transmit and receive queues in a UCS ethernet adapter policy, checking /proc/interrupts after rebooting will reveal that only 8+8 were created anyway. I assume that this is a hardware limitation. I also tested this with an older B200 M2 blade with an M81KR network adapter, and the limitation was exactly the same - only eight queues per direction were created.

I have to say that a maximum of eight receive queues is far from impressive, as other common 10 Gb network adapters support many more. The Intel 82599 supports 128 receive/transmit queues, for example. That said, having eight receive queues can make a world of difference compared to having just the default single one.

I also found out that it is not safe to configure the maximum possible 256 transmit and receive queues in the ethernet adapter policy. One might assume that doing so would cause the system to simply adjust the effective number down to the maximum supported by hardware. However, that approach works only for service profiles with a single vNIC; the service profile fails to associate if it contains two or more vNICs with such a policy. Looking at the FSM status while attempting this, the B200 M2 with the M81KR adapter gets stuck with an error message of Out of CQ resources, while the B200 M3 with the VIC 1240 would get Adapter configDataNicCb(): vnicEthCreate failed. Attempting to reboot them in this state didn’t work either, they just got stuck - the M2 blade entered the EFI shell, while the M3 entered the BIOS Setup utility.

Thus my conclusion is that the optimal number of receive and transmit queues to configure in the default ethernet adapter policy is 8+8 for any server containing the M81KR or VIC 1240 adapter. For other adapter models, attempting a boot with 256+256 queues and a single vNIC is probably a good way to determine the actual hardware limitations (and, by extension, the optimal default values for that particular adapter model).

In any case, discovering the default UCS behaviour was kind of like coming home after having bought a new sports car with a V8 engine, only to discover that the manufacturer had only bothered to install a spark plug in one out its of eight cylinders. It is truly a terrible default! If someone at Cisco ever reads this, I’d strongly suggest that the default behaviour would be simply to enable the maximum number of queues supported by the hardware in question. That’s the only way to unleash the full performance of the hardware, and it is certainly a prerequisite in order for a web server workload to come anywhere near fully utilising the 10 Gb/s of available network bandwidth.