~tore Tore Anderson's technology blog

Cisco UCS, multi-queue NICs, and RSS

The other day one of my colleagues at Redpill Linpro asked me to help him figure out why a web cache server started responding slowly during a traffic peak. My colleague was scratching his head over the problem, because although the traffic level was unusually high for the server in question, it was nowhere close to saturating the server’s 10 Gb/s of available network bandwidth:

The server was running the Varnish Cache, a truly excellent piece of software which, when running on modern hardware, will easily serve 10 Gb/s of web traffic without breaking a sweat. The CPU graph confirmed that lack of processing capacity had not been an issue; the server in question, a Cisco UCS B200 M3, had been mostly idling during the problematic period:

In spite of the above, another graph gave a significant clue as to what was going on - the network interface had been dropping quite a few inbound packets:

That certainly explained the slowness - dropped packets lead to TCP timeouts and subsequent retransmissions, which will be rather damaging to interactive and latency-sensitive application protocols such as HTTP. My colleague had correctly identifed what had happened - the remaining question was why?

Diagnosing the root cause of the dropped packets

Checking the output from the diagnostic commands ip -s -s link show dev eth5 and ethtool -S eth5 on the server in question revealed that every single one of the dropped packets were missed due to rx_no_bufs. In other words, inbound packets had been arriving faster than the server had been able to process them.

Taking a closer look at the CPU graph revealed a subtle hint: the softirq field had exceeded 100%. While it is not possible to tell with certainty from the aggregated graph, this could mean that a single one of the server’s 40 CPU cores had been completely busy processing software interrupts - which happens to be the where incoming network packets are processed. (If you’re interested in learning more about Linux’ software interrupt mechanism, take a look at this LWN article.)

I then checked how many interrupts the network adapter had:

tore@ucstest:~$ awk '/eth5/ {print $NF}' /proc/interrupts 
eth5-rx-0
eth5-tx-0
eth5-err
eth5-notify

Only a single receive queue! In other words, the server’s network adapter did not appear to be multi-queue NIC. This in turn meant that every incoming packet during the problematic period would have been processed by single CPU core. This CPU core would in all likelihood have been completely overloaded, while all the other 39 CPU cores were just sitting there with almost nothing to do.

Enabling multiple queues and Receive-side Scaling

It fortunately turned out that the network adapter in question, a Cisco UCS VIC 1240, is a multi-queue NIC - but this functionality is for some unfathomable reason disabled by the default ethernet adapter policy:

ucs1-osl3-B# scope org
ucs1-osl3-B /org # enter eth-policy default
ucs1-osl3-B /org/eth-policy # show expand 

Eth Adapter Policy:
    Name: default

    ARFS:
        Accelarated Receive Flow Steering: Disabled

    Ethernet Completion Queue:
        Count: 2

    Ethernet Failback:
        Timeout (sec): 5

    Ethernet Interrupt:
        Coalescing Time (us): 125
        Coalescing Type: Min
        Count: 4
        Driver Interrupt Mode: MSI-X

    NVGRE:
        NVGRE: Disabled

    Ethernet Offload:
        Large Receive: Enabled
        TCP Segment: Enabled
        TCP Rx Checksum: Enabled
        TCP Tx Checksum: Enabled

    Ethernet Receive Queue:
        Count: 1   <------ only 1 receive queue configured!
        Ring Size: 512

    VXLAN:
        VXLAN: Disabled

    Ethernet Transmit Queue:
        Count: 1   <------ only 1 transmit queue configured!
        Ring Size: 256

    RSS:
        Receive Side Scaling: Disabled

These settings can also be seen (and changed) in the UCS Manager GUI, under Servers -> Policies -> Adapter Policies:

Fortunately, it was possible to improve matters by simply changing the ethernet adapter policy. Hardware in a Cisco UCS environment can take on different personalities based on software configuration, and the number of queues in a network adapter is no exception. The below commands shows how you can increase both the number of receive and transmit queues:

ucs1-osl3-B# scope org
ucs1-osl3-B /org # enter eth-policy default
ucs1-osl3-B /org/eth-policy # set recv-queue count 8
ucs1-osl3-B /org/eth-policy* # set trans-queue count 8

However, in order to actually make use of the multiple receive queues, it is also necessary to enable Receive-side scaling (RSS). RSS is what ensures that the network adapter will uniformly distribute incoming packets across its multiple receive queues, which in turn are routed to separate CPU cores. In addition, it is necessary to configure the number of completion queues to the sum of configured receive and transmit queues, and the number of interrupts to the number of completion queues plus 2:

ucs1-osl3-B /org/eth-policy* # set rss receivesidescaling enabled 
ucs1-osl3-B /org/eth-policy* # set comp-queue count 16
ucs1-osl3-B /org/eth-policy* # set interrupt count 18

One might stop at this point to wonder why one has to explicitly enable RSS when recv-queue count is configured to more than 1, and similarly why the values for comp-queue count and interrupt count must be explicitly set intead of being automatically calculated. I have no idea. It is what it is.

Finally, I also noticed that Accelerated Receive Flow Steering (ARFS) is supported, but not enabled by default. Reading about it (and RFS in general), it seems to me that ARFS is also something that you really want by default if you care about performance. Thus:

ucs1-osl3-B /org/eth-policy # set arfs accelaratedrfs enabled 

(Yes, accelaratedrfs is the spelling expected by the UCS CLI.)

Activating the changes at this point is only a matter of issuing the standard commit-buffer command. That said, do be aware that a reboot will be required to activate these changes, which in turn means that any service profile that’s using this ethernet adapter policy and has a maintenance policy set to immediate will instantly reboot.

After the reboot, we can see that the ethernet adapter now has the requested number of queues and interrupts available:

tore@ucstest:~$ awk '/eth5/ {print $NF}' /proc/interrupts 
eth5-rx-0
eth5-rx-1
eth5-rx-2
eth5-rx-3
eth5-rx-4
eth5-rx-5
eth5-rx-6
eth5-rx-7
eth5-tx-0
eth5-tx-1
eth5-tx-2
eth5-tx-3
eth5-tx-4
eth5-tx-5
eth5-tx-6
eth5-tx-7
eth5-err
eth5-notify

Problem solved! The server is now much better prepared to deal with the next traffic peak, as inbound traffic will now be distributed across eight CPU cores intead of just one. I expect that the server’s 10 Gb/s of available network bandwidth will be saturated with outbound traffic long before the rate of incoming packets would become a bottleneck.

Note that it’s also important to ensure that the irqbalance daemon is running. Without it, all eight eth5-rx-* interrupts could potentially end up being routed to the same CPU core anyway, which would mean we’ve gained absolutely nothing. Fortunately, irqbalance is enabled by default on most Linux distributions.

Regarding hardware limitations

You might wonder why I enabled only eight queues for each direction, given that the blade in question has 40 CPU cores. Well, I did try to enable more, and while it is indeed possible to configure up to a maximum of 256 transmit and receive queues in a UCS ethernet adapter policy, checking /proc/interrupts after rebooting will reveal that only 8+8 were created anyway. I assume that this is a hardware limitation. I also tested this with an older B200 M2 blade with an M81KR network adapter, and the limitation was exactly the same - only eight queues per direction were created.

I have to say that a maximum of eight receive queues is far from impressive, as other common 10 Gb network adapters support many more. The Intel 82599 supports 128 receive/transmit queues, for example. That said, having eight receive queues can make a world of difference compared to having just the default single one.

I also found out that it is not safe to configure the maximum possible 256 transmit and receive queues in the ethernet adapter policy. One might assume that doing so would cause the system to simply adjust the effective number down to the maximum supported by hardware. However, that approach works only for service profiles with a single vNIC; the service profile fails to associate if it contains two or more vNICs with such a policy. Looking at the FSM status while attempting this, the B200 M2 with the M81KR adapter gets stuck with an error message of Out of CQ resources, while the B200 M3 with the VIC 1240 would get Adapter configDataNicCb(): vnicEthCreate failed. Attempting to reboot them in this state didn’t work either, they just got stuck - the M2 blade entered the EFI shell, while the M3 entered the BIOS Setup utility.

Thus my conclusion is that the optimal number of receive and transmit queues to configure in the default ethernet adapter policy is 8+8 for any server containing the M81KR or VIC 1240 adapter. For other adapter models, attempting a boot with 256+256 queues and a single vNIC is probably a good way to determine the actual hardware limitations (and, by extension, the optimal default values for that particular adapter model).

In any case, discovering the default UCS behaviour was kind of like coming home after having bought a new sports car with a V8 engine, only to discover that the manufacturer had only bothered to install a spark plug in one out its of eight cylinders. It is truly a terrible default! If someone at Cisco ever reads this, I’d strongly suggest that the default behaviour would be simply to enable the maximum number of queues supported by the hardware in question. That’s the only way to unleash the full performance of the hardware, and it is certainly a prerequisite in order for a web server workload to come anywhere near fully utilising the 10 Gb/s of available network bandwidth.

Homenet - the future of home networking

Today’s residential home networks are quite simple. They usually have a single Internet connection, which is plugged in to the WAN port of a residential gateway. The gateway will typically feature a few wired Ethernet ports and a wireless access point, which are in most cases bridged together to form a single layer-2 LAN segment. The LAN segment is configured with private IPv4 addresses; in order to let the hosts and devices on the LAN segment to communicate with the IPv4 Internet, the gateway will perform IPv4 NAT.

If the requirements of the home network are equally simple, this will work well enough for most users. However, the moment you start adding more functionality, things get complicated quickly. For example, how does one go about introducing another Internet connection? Or another residential gateway? Or IPv6? Or all of the above? Someone competent in computer networking might well be able to set it up, but Joe Public will probably have to resort to trial and error. He might end up with a completely non-functional network, or he might get lucky and stumble across configuration that ostensibly works - but even in this case, it is very likely that the home network would suffer a loss of functionality and/or performance in the process.

Introducing the IETF Homenet working group

Fortunately, the fact that residential home networks were rapidly falling behind the technology curve wasn’t lost on the IETF. In 2011, the working group Home Networking - colloquially known as Homenet - was founded to create a set of standards that would allow even the most non-technical user to fully unleash the potential of his home network in a self-configuring «plug and play» manner. Quoting from the working group charter:

This working group focuses on the evolving networking technology within and among relatively small “residential home” networks. For example, an obvious trend in home networking is the proliferation of networking technology in an increasingly broad range and number of devices. […]

Home networks need to provide the tools to handle these situations in a manner accessible to all users of home networks. Manual configuration is rarely, if at all, possible, as the necessary skills and in some cases even suitable management interfaces are missing.

The purpose of this working group is to focus on this evolution, in particular as it addresses the introduction of IPv6, by developing an architecture addressing this full scope of requirements:

o prefix configuration for routers
o managing routing
o name resolution
o service discovery
o network security

The decision to base the new standards on an IPv6 foundation was likely an easy one to make. Not only is IPv6 the only future-proof option, it also comes with certain features that facilitate automatic and self-configuring networks (such as ubiquitous link-local addresses and SLAAC). That said, Homenet won’t deprive anyone of their IPv4 connectivity - the working group isn’t blind to the reality that IPv4 will remain a necessity for most users in the years to come:

The group should assume that an IPv4 network may have to co-exist alongside the IPv6 network and should take this into account insofar as alignment with IPv6 is desirable.

So far, the Homenet working group has published one RFC titled IPv6 Home Networking Architecture Principles, and is currently actively working on a number of other Internet-Drafts that describe the nitty-gritty details of how it all fits together.

Running code: the Hnet project

While it’s clearly important to have the Homenet standard properly documented in RFCs, these documens aren’t particularly useful on their own. At the end of the day, it’s the availability of functioning implementations of those RFCs, i.e., running code, that truly matters.

In spite of being a work in progress, the current draft specifications has proven mature enough for the Hnet project to build a working open-source Homenet implementation. The Hnet project is included in the latest stable OpenWrt release, version 15.05 Chaos Calmer.

That means that if you own a residential gateway device that’s amongst the several hundred models supported by OpenWrt (or are willing to spend something like €20-€30 on one), you can already **today** take Homenet for a spin and experience the future of home networking. In the past few weeks I’ve been doing just that, and I can say that I am pleasantly surprised at how well it actually works. It’s even fully integrated in OpenWrt’s web interface - no command line familiarity required. I’ve converted my own home network to be based exclusively on OpenWrt and Hnet, and I see no reason to going back to my old legacy setup.

In an upcoming post I will explain how to take a default installation of OpenWrt 15.05 Chaos Calmer, install the software from the Hnet project, and configure it to be a Homenet router. Stay tuned!

IPv6 mobile roaming: possible or not?

Since 2012 I have been voting with my wallet, opting to only use mobile providers that give me IPv6 connectivity. To begin with, I was a customer of Network Norway. They were kind enough to include me in their IPv6-only pilot with DNS64 and NAT64. Unfortunately, Network Norway’s fate was to be acquired multiple times, causing their IPv6 pilot to lose momentum due to the loss of key technical personnel. The IPv6 pilot has not yet transitioned to a production service.

In 2014, I changed to Telenor Norway. Telenor provides two APNs that support IPv6:

  • telenor.smart, Telenor’s default APN. It supports the IP, IPV6, and IPV4V6 PDP context types. telenor.smart uses CGN for IPv4 Internet access, and does not provide DNS64/NAT64 service.
  • telenor.mobil, which supports only the IPV6 PDP context type, as well as DNS64/NAT64.

IPv6 is a fully supported production service in Telenor’s mobile network, meaning that any Telenor subscriber can configure his or her device to use IPv6 with one of these APNs, if it isn’t already doing so by default.

During the years I’ve had IPv6 service from my mobile providers, I have travelled a lot. I have been to much of Europe, as well as in several other countries including Canada, Japan, and USA. Wherever I went, roaming with IPv6 has been one of those things that Just Works. Therefore I was greatly surprised to hear that during the APNIC 40 conference, Telstra Australia’s Sunny Yeung stated that «until every carrier has activated IPv6 there is no way to activate IPv6 for international roaming». Sunny couldn’t possibly be right, could he? Had I just imagined that IPv6 roaming worked for me?

An upcoming business trip to Sweden soon gave me the opportunity double-check whether or not IPv6 roaming truly works. To that end, I used Jason Fesler’s excellent test-ipv6.com site to prove beyond any doubt that IPv6 roaming does work, as demonstrated below:

The two screenshots above are from my Jolla Phone when connected to the telenor.smart APN using an dual-stack IPV4V6 PDP context. As the second screenshot shows, the phone has been provisioned by a globally unique IPv6 address as well as a private IPv4 address. (A keen observer might note that there is no indication that I am actually roaming in either screenshot. You’ll just have to take my word for it; I simply don’t know which command is used to show the roaming status in Jolla’s Sailfish OS.)

This screenshot shows roaming using the built-in cellular modem in my laptop. This modem is not exactly state of the art, so it does not support the dual-stack IPV4V6 PDP context type. Therefore, I instead use a single-stack IPV6 PDP context towards the telenor.mobil APN. You can also see that I am using clatd to set up a 464XLAT CLAT interface. This provides any legacy IPv4-only applications running on my laptop with seemingly native IPv4 connectivity they can use to communicate with the IPv4 Internet.

Sunny, I hope you consider this post very good news. After all, it might mean that deploying IPv6 in Telstra’s mobile network isn’t the impossibility that your APNIC 40 statement suggests you currently think it is. While it’s evident that you haven’t found the way to do it yet, the efforts your colleagues in Telenor Norway demonstrates that there clearly is a way. I know several of the folks involved in Telenor’s IPv6 efforts, they are a friendly bunch - do not hesitate to contact me if you want me to introduce you to them! I’m certain that they would gladly help you finding your way towards deploying IPv6 in Telstra’s mobile network.

Working around GitHub Pages and Fastly's missing IPv6 support using Apache mod_proxy

The problem

As I noted in my previous post, GitHub Pages (just like GitHub itself) does not support IPv6. This is because GitHub Pages’ CDN provider, Fastly, doesn’t support it:

$ host toreanderson.github.io
toreanderson.github.io is an alias for github.map.fastly.net.
github.map.fastly.net has address 185.31.17.133

I find this very disappointing. Fastly was founded in 2011, so they are a rather new CDN. Their platform is built on top of Varnish, which has to the best of my knowledge supported IPv6 even before Fastly was founded, so it’s not like their lack of IPv6 could be explained by having legacy and difficult to upgrade IPv4-only internal infrastructure. I find it rather ironic that a provider calling themselves Fastly fails to support the protocol that was recently reported by Facebook to yield 30-40% faster time-to-last-byte web page load times than IPv4. So in case anyone from Fastly is reading this, I suggest you either a) start supporting IPv6 ASAP, or failing that, b) rename your CDN platform to something more appropriate, like Slowly.

The workaround

My employer is kind enough to provide me with a virtual machine I can use for personal purposes. This server runs Linux, Ubuntu Trusty LTS to be specific. It is of course available over both IPv6 as well as IPv4 (using SIIT-DC), so the idea here is to use it to provide a dual-stacked façade, thus concealing the fact that the GitHub Pages service doesn’t support IPv6.

As this server is already hosting my sorry excuse for a home page, it had already the Apache web server software installed. Apache comes with mod_proxy, which is perfectly suited for what I want to do.

The first order of business is to ensure the mod_proxy module is loaded. On Ubuntu, this is easiest done using the a2enmod utility:

# a2enmod proxy_http
Considering dependency proxy for proxy_http:
Enabling module proxy.
Enabling module proxy_http.
To activate the new configuration, you need to run:
  service apache2 restart
# service apache2 restart
 * Restarting web server apache2                              [ OK ]

The next order of business is to create a VirtualHost definition that uses mod_proxy to forward all incoming HTTP requests to GitHub Pages. I did this by creating a new file /etc/apache2/sites-enabled/http_blog.fud.no.conf with the contents below, before reloading the configuration with the command apache2ctl graceful.

<VirtualHost *:80>
	ServerName blog.toreanderson.no
	ServerAlias blog.fud.no
	ProxyPass "/" "http://toreanderson.github.io/"
	ProxyPassReverse "/" "http://toreanderson.github.io/"
</VirtualHost>

The ProxyPass directive makes incoming HTTP requests from clients be forwarded to http://toreanderson.github.io/. ProxyPassReverse ensures that any HTTP headers containing the string http://toreanderson.github.io/ in the server response from GitHub Pages will be changed back to http://blog.toreanderson.no/ (or http://blog.fud.no/). I’m not exactly sure if ProxyPassReverse is really needed for GitHub Pages, but it doesn’t hurt to have it in the configuration anyway.

The final order of business is to ensure that the two hostnames mentioned in the ServerName and ServerAlias directives exist in DNS and are pointing to the server. I did this by adding simply adding IN CNAME records that points to an already existing hostname with IPv4 IN A and IPv6 IN AAAA records:

$ host -t CNAME blog.fud.no.
blog.fud.no is an alias for fud.no.
$ host -t CNAME blog.toreanderson.no.
blog.toreanderson.no is an alias for fud.no.
$ host -t A fud.no.
fud.no has address 87.238.60.0
$ host -t AAAA fud.no.
fud.no has IPv6 address 2a02:c0:1001:100::145

Another thing worth mentioning here: By using my own domain names, I am also making sure that my blog’s URL is secured using DNSSEC, another important piece of Internet technology that GitHub Pages and Fastly currently neglect to support.

Summary

With the help of Apache mod_proxy, http://blog.toreanderson.no is now available over both IPv4 and IPv6. I am therefore now comfortable with letting people know that this blog actually exists. While this workaround is far from ideal from a technical point of view, it is better than the alternative - having to wait an indeterminate amount of time for Fastly to get around to dual-stacking their CDN.

First post

So I’ve finally created my own blog, and you’ve found it, somehow. Congratulations! Expect only posts about technology - networking, data centres, open source software, reports from conferences I attend, et cetera. Essentially, various stuff I play around with both at home and at my workplace Redpill Linpro.

I guess a sensible thing to discuss in the first post is my choice of blogging platform. There were several criteria on my wish list:

  • I want my content to stay mine, and be trivially portable to another platform (including self-hosting) if I so choose. I am therefore rather sceptical to «blog in the cloud» solutions such as Blogger and WordPress.
  • I wanted to get started really quickly without spending any time doing web design, software installations, database setups, and so on. Installing and maintaining my own instance of WordPress or some other CMS - no thanks.
  • I prefer to author posts in a simple (yet sufficiently powerful) markup language that is well suited to technical content (automatic syntax highlighting of quoted code, for example). The markup language should of course be editable with my favourite editor too, so no binary formats!
  • I would very much like to use a simple Git repo as the underlying database where all the content is stored.
  • I want my content to be available over IPv6. I expect to be writing quite a lot about IPv6, so not having the blog available over IPv6 would feel rather embarrassing.

So what I ended up with in the end was GitHub Pages, which automatically renders the files stored in the backend Git repo through Jekyll to create a simple blog. In addition to that, I used Tinypress in order to bootstrap the initial contents of the backend Git repo. Tinypress also provides a web interface where I can create or edit posts, which I think will prove convenient from time to time.

So far so good. The one thing that’s missing, is IPv6 support. The GitHub Pages service, or more precisely its CDN provider Fastly, does not appear to support IPv6 yet:

$ host toreanderson.github.io
toreanderson.github.io is an alias for github.map.fastly.net.
github.map.fastly.net has address 185.31.17.133

Given that it’s 2015, that’s rather disappointing, but I’m guessing I’ll find a way to work around it for now. Most likely I’ll set up some IPv6 frontend system at work that can convert IPv6 traffic to IPv4 and pass it along to the Fastly back-end. The details of this will most likely be the second post. Stay tuned.