SuperComputing 2011
Moving Towards Terabit/Sec Transfers

40GE FAST Data Transfer Kit - Server / Network Tuning

1. Hardware Selection 2. Test Setup 3. Optimization 4. Results

Changes in Server BIOS

  • Hyper Threading (HT) should be disabled as it involves extra latency in processes.
  • We recommend disabling the power saving mode i.e. processor C1E states, but ofcourse processor runs at frequency all the time.
  • Some motherboards e.g. SuperMicro, requires manualy changing the PCIe slot from Gen2 to Gen3 which is specifically required for Mellanox NIC.

Changing TCP/IP Kernel paramters

Mellanox updates /etc/sysctl.conf with kernel parameters which are sufficient for 40GE flows where single flow in LAN environment can reach upto 18Gbps.

## MLXNET tuning parameters ##
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_low_latency = 1
net.core.netdev_max_backlog = 250000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
## END MLXNET ##

Melllanox Driver

By Default Mellanox driver looks at number of cores and assigns to a maximum of RX and TX queues but not exceeding to 16 for each. However if requied these queues can be limited to lesser number but should be in the power of 2.

# cat /etc/modprobe.d/mlx4_en.conf
options mlx4_en num_rx_rings=8

Linux Interface Configuration

  • MTU should be set to 9000
  • According to discussion with Mellanox, changing txqueuelen from default value of 1000 to 10000 is not required any more with the introduction of a driver supporting multiple TX queues.
  • For longer RTT it is recommended to change default Mellanox Ring Buffer from 1024 to 8192 using # ethtool -G eth2 rx 8192 tx 8192
    • Imp: Changing ring buffer values will close the ethernet port and open it again

Kernel SMP Affinity

Mellanox provides a very handy tool in setting the correct smp affinity for the ethernet interface. As discussed in the setup we just need to know the correct processor where NIC is connected. In our case Mellanox NIC is installed in slot 5 which is connected to processor 1, thus:

# irq_set_affinity_by_node 1 eth2

will set proper affinity for eth2

Processor NUMA management for processes

Information about the NUMA nodes and node distances on your system:

# numactl --hardware

Allocate the memory on the same NUMA node where the process is running (in this example processor cores 1,2 on NUMA node 0):

# numactl --physcpubind=1,2   --localalloc /usr/java/latest/bin/java -jar /root/fdt.jar &

Bind a process only to NUMA node 0:

# numactl --cpunodebind=0 /usr/java/latest/bin/java -jar /root/fdt.jar&

 

Network Switch Tuning

Switch ports connecting to mellanox NIC must have the flow control enabled.

  • Z9000 (config t)# flowcontrol tx on rx on

Jumbo frames should be enabled.

  • Z9000 (config t)# mtu 9216

A useful setting to see interface statistics updated every 30 seconds while analyzing traffic on the switch port is

  • Z9000 (config t)# rate-interval 30