Cisco UCS, multi-queue NICs, and RSS
The other day one of my colleagues at Redpill Linpro asked me to help him figure out why a web cache server started responding slowly during a traffic peak. My colleague was scratching his head over the problem, because although the traffic level was unusually high for the server in question, it was nowhere close to saturating the server’s 10 Gb/s of available network bandwidth:
The server was running the Varnish Cache, a truly excellent piece of software which, when running on modern hardware, will easily serve 10 Gb/s of web traffic without breaking a sweat. The CPU graph confirmed that lack of processing capacity had not been an issue; the server in question, a Cisco UCS B200 M3, had been mostly idling during the problematic period:
In spite of the above, another graph gave a significant clue as to what was going on - the network interface had been dropping quite a few inbound packets:
That certainly explained the slowness - dropped packets lead to TCP timeouts and subsequent retransmissions, which will be rather damaging to interactive and latency-sensitive application protocols such as HTTP. My colleague had correctly identifed what had happened - the remaining question was why?
Diagnosing the root cause of the dropped packets
Checking the output from the diagnostic commands ip -s -s link show dev eth5
and ethtool -S eth5
on the server in question revealed that every single one
of the dropped packets were missed due to rx_no_bufs. In other
words, inbound packets had been arriving faster than the server had been able
to process them.
Taking a closer look at the CPU graph revealed a subtle hint: the softirq
field had exceeded 100%. While it is not possible to tell with certainty from
the aggregated graph, this could mean that a single one of the server’s 40
CPU cores had been completely busy processing software interrupts - which
happens to be the where incoming network packets are processed. (If you’re
interested in learning more about Linux’ software interrupt mechanism, take a
look at this LWN article.)
I then checked how many interrupts the network adapter had:
tore@ucstest:~$ awk '/eth5/ {print $NF}' /proc/interrupts
eth5-rx-0
eth5-tx-0
eth5-err
eth5-notify
Only a single receive queue! In other words, the server’s network adapter did not appear to be multi-queue NIC. This in turn meant that every incoming packet during the problematic period would have been processed by single CPU core. This CPU core would in all likelihood have been completely overloaded, while all the other 39 CPU cores were just sitting there with almost nothing to do.
Enabling multiple queues and Receive-side Scaling
It fortunately turned out that the network adapter in question, a Cisco UCS VIC 1240, is a multi-queue NIC - but this functionality is for some unfathomable reason disabled by the default ethernet adapter policy:
ucs1-osl3-B# scope org
ucs1-osl3-B /org # enter eth-policy default
ucs1-osl3-B /org/eth-policy # show expand
Eth Adapter Policy:
Name: default
ARFS:
Accelarated Receive Flow Steering: Disabled
Ethernet Completion Queue:
Count: 2
Ethernet Failback:
Timeout (sec): 5
Ethernet Interrupt:
Coalescing Time (us): 125
Coalescing Type: Min
Count: 4
Driver Interrupt Mode: MSI-X
NVGRE:
NVGRE: Disabled
Ethernet Offload:
Large Receive: Enabled
TCP Segment: Enabled
TCP Rx Checksum: Enabled
TCP Tx Checksum: Enabled
Ethernet Receive Queue:
Count: 1 <------ only 1 receive queue configured!
Ring Size: 512
VXLAN:
VXLAN: Disabled
Ethernet Transmit Queue:
Count: 1 <------ only 1 transmit queue configured!
Ring Size: 256
RSS:
Receive Side Scaling: Disabled
These settings can also be seen (and changed) in the UCS Manager GUI, under Servers -> Policies -> Adapter Policies:
Fortunately, it was possible to improve matters by simply changing the ethernet adapter policy. Hardware in a Cisco UCS environment can take on different personalities based on software configuration, and the number of queues in a network adapter is no exception. The below commands shows how you can increase both the number of receive and transmit queues:
ucs1-osl3-B# scope org
ucs1-osl3-B /org # enter eth-policy default
ucs1-osl3-B /org/eth-policy # set recv-queue count 8
ucs1-osl3-B /org/eth-policy* # set trans-queue count 8
However, in order to actually make use of the multiple receive queues, it is also necessary to enable Receive-side scaling (RSS). RSS is what ensures that the network adapter will uniformly distribute incoming packets across its multiple receive queues, which in turn are routed to separate CPU cores. In addition, it is necessary to configure the number of completion queues to the sum of configured receive and transmit queues, and the number of interrupts to the number of completion queues plus 2:
ucs1-osl3-B /org/eth-policy* # set rss receivesidescaling enabled
ucs1-osl3-B /org/eth-policy* # set comp-queue count 16
ucs1-osl3-B /org/eth-policy* # set interrupt count 18
One might stop at this point to wonder why one has to explicitly enable RSS
when recv-queue count
is configured to more than 1, and similarly why the
values for comp-queue count
and interrupt count
must be explicitly set
intead of being automatically calculated. I have no idea. It is what it is.
Finally, I also noticed that Accelerated Receive Flow Steering (ARFS) is supported, but not enabled by default. Reading about it (and RFS in general), it seems to me that ARFS is also something that you really want by default if you care about performance. Thus:
ucs1-osl3-B /org/eth-policy # set arfs accelaratedrfs enabled
(Yes, accelaratedrfs is the spelling expected by the UCS CLI.)
Activating the changes at this point is only a matter of issuing the standard
commit-buffer
command. That said, do be aware that a reboot will be required
to activate these changes, which in turn means that any service profile that’s
using this ethernet adapter policy and has a maintenance
policy
set to immediate
will instantly reboot.
After the reboot, we can see that the ethernet adapter now has the requested number of queues and interrupts available:
tore@ucstest:~$ awk '/eth5/ {print $NF}' /proc/interrupts
eth5-rx-0
eth5-rx-1
eth5-rx-2
eth5-rx-3
eth5-rx-4
eth5-rx-5
eth5-rx-6
eth5-rx-7
eth5-tx-0
eth5-tx-1
eth5-tx-2
eth5-tx-3
eth5-tx-4
eth5-tx-5
eth5-tx-6
eth5-tx-7
eth5-err
eth5-notify
Problem solved! The server is now much better prepared to deal with the next traffic peak, as inbound traffic will now be distributed across eight CPU cores intead of just one. I expect that the server’s 10 Gb/s of available network bandwidth will be saturated with outbound traffic long before the rate of incoming packets would become a bottleneck.
Note that it’s also important to ensure that the
irqbalance
daemon is running.
Without it, all eight eth5-rx-*
interrupts could potentially end up being
routed to the same CPU core anyway, which would mean we’ve gained absolutely
nothing. Fortunately, irqbalance
is enabled by default on most Linux
distributions.
Regarding hardware limitations
You might wonder why I enabled only eight queues for each direction, given that
the blade in question has 40 CPU cores. Well, I did try to enable more, and
while it is indeed possible to configure up to a maximum of 256 transmit and
receive queues in a UCS ethernet adapter policy, checking /proc/interrupts
after rebooting will reveal that only 8+8 were created anyway. I assume that
this is a hardware limitation. I also tested this with an older B200
M2
blade with an
M81KR
network adapter, and the limitation was exactly the same - only eight queues
per direction were created.
I have to say that a maximum of eight receive queues is far from impressive, as other common 10 Gb network adapters support many more. The Intel 82599 supports 128 receive/transmit queues, for example. That said, having eight receive queues can make a world of difference compared to having just the default single one.
I also found out that it is not safe to configure the maximum possible 256
transmit and receive queues in the ethernet adapter policy. One might assume
that doing so would cause the system to simply adjust the effective number down
to the maximum supported by hardware. However, that approach works only for
service profiles with a single vNIC; the service profile fails to associate if
it contains two or more vNICs with such a policy. Looking at the FSM status
while attempting this, the B200 M2 with the M81KR adapter gets stuck with an
error message of Out of CQ resources
, while the B200 M3 with the VIC 1240
would get Adapter configDataNicCb(): vnicEthCreate failed
. Attempting to
reboot them in this state didn’t work either, they just got stuck - the M2
blade entered the EFI shell, while the M3 entered the BIOS Setup utility.
Thus my conclusion is that the optimal number of receive and transmit queues to configure in the default ethernet adapter policy is 8+8 for any server containing the M81KR or VIC 1240 adapter. For other adapter models, attempting a boot with 256+256 queues and a single vNIC is probably a good way to determine the actual hardware limitations (and, by extension, the optimal default values for that particular adapter model).
In any case, discovering the default UCS behaviour was kind of like coming home after having bought a new sports car with a V8 engine, only to discover that the manufacturer had only bothered to install a spark plug in one out its of eight cylinders. It is truly a terrible default! If someone at Cisco ever reads this, I’d strongly suggest that the default behaviour would be simply to enable the maximum number of queues supported by the hardware in question. That’s the only way to unleash the full performance of the hardware, and it is certainly a prerequisite in order for a web server workload to come anywhere near fully utilising the 10 Gb/s of available network bandwidth.