SuperComputing 2010
Moving Towards Terabit/Sec Transfers



Detailed report for UMichigan and MSU

 


Shawn Mckee (smckee@umich.edu)

2 systems at Michigan State University (MSU) and 4 systems at University of Michigan for SC10 were prepared (two of them were "Shared" with other SC10 demos).

System Specs:

i) MSU: 2xDell R710, 48GB, Intel X520 DA NIC (2xSFP+ 10GE ports), 2xE5620 (Westmere,2.4GHz), 3xH800 RAID ctrl, 6xMD1200 disk shelves (12x2TB NLSAS 7.2k RPM, RAID-6; 2 shelves per H800 controller, redun. cabled)
ii) UM: 2xDell R610, 48GB, Intel X520 DA NIC (UMFS01: 2xSFP+ 10GE ports) OR Broadcom BCM57711 (UMFS03: 2xSFP+, 10GE), 2xE5620 (Westmere,2.4GHz), 1xH800 RAID ctrl, 5xMD1200 disk shelves (12x2TB NLSAS 7.2k RPM, RAID-0; 5 shelves per H800 controller. Non-redun. cabled) AND
iii) UM: 2xDell R710, 24GB, Intel NetEffect NE020 (iWARP RNIC) 10GE CX4, 2xX5650 (Westmere, 2.66 GHz), 1xH700 RAID ctrl, 6x2TB NLSAS 7.2k RPM disks in RAID-0 in chassis.

For the WAN usage we wrote to MSUFS12/MSUFS13 from sc9-sc12 and achieved 9 Gbps. Only 2 shelves (out of 6) were used on each MSUFS12 and MSUFS13 ( 1 per SC10 node). Then we started transfers back from MSUFS12 and MSUFS13 to nodes at SC10...this decreased the "writing" rate to 7.5 Gbps and reading was 3.5 Gbps (writing on sc11 and sc12 on the showroom floor). The decrease in the writing rate was because sc11 and sc12 were not able to both read and write without interfering. I believe that if we have an equivalent storage server to MSUFS12/MSUFS13 at SC10 we could have easily filled a 10GE WAN pipe with just 1 such system on each end.

Using FDT we were able to read from 2 (out of 6) shelves on MSUFS12 and write to 2 (out of 6 shelves) on MSUFS13 (see attached msufs12_13_fdt_2.png) at 7 Gbps. We then started reading from two shelves on MSUFS13 and writing to 2 shelves on MSUFS12 also at 7 Gbps (at the same time as the prior transfers were going on).

All systems were benchmarked with IOZONE using 1 thread per "disk" (shelf array). So the MSU nodes were tested with 6 threads, the UM R610's with 5 and the UM R710's with 1.

Following is the graphic from the IOZONE test on MSUFS13 (green is "write", blue is "read"). We were able to achieve about 4.2 GBytes/sec writing (6 shelves in parallel) and 5.1 GBytes/sec reading( 6 shelves in parallel) for the IOZONE test on its RAID-6 arrays.


For the UM R610's we got 2.35 GBytes/sec writing and 2.95 GBytes/sec reading (total) on its RAID-0 arrays.

For the UM R710's (6 disks in server chassis) we got 0.37 GBytes/sec writing and 0.34 GBytes/sec reading its RAID-0 arrays.

In memory-to-memory tests we were able to fill the pipe (9.95 Gbps) in one direction using MSUFS12. We were also able to got both directions at 9 Gbps each way at the same time.

There were numerous issues on at SC10 with link quality and minor routing issues that were quickly resolved. Debugging the WAN links was harder but at least feasible because of having test nodes at intermediate locations to help isolate the problem. One important relearning was that iperf has an order dependency on some of its arguments. For example, when running UDP tests, the bandwidth and packet size arguments (-b -l ) should come *after* the -c <server_ip> option.

Some of the issues we saw were actually from having AGLT2 production traffic interacting with SC10 data flows. One very useful outcome was the ability to show that our MiLR connections for the Tier-2 are actually running well and able to pass 8 Gbps of UDP traffic without loss (when there isn't real congestion present).

Another issue for AGLT2 was making sure we have the layer-2 paths working correctly. We use "per-VLAN" spanning tree on Nile (Cisco) but Rapid Spanning-Tree (RST) on the Dells. We did see some issues related to spanning tree getting confused while we tried to engineer non-conflicting paths utilizing redundant links. This is an area we need to do more work in (having resilient links with multiple spanning-trees to utilize more of the available bandwidth).

The FDT interface is a very useful thing to have and really makes doing SC demos feasible. One of the main issues is getting the FDT arguments setup well (knowing the possible sources/destinations and related flags to make sure data flows the way you want it to). Maybe having some additional preloading for disk-to-disk related parameters that are "found" by LISA might be possible in the future? (Pick two hosts to "exchange" data and the interface sets up transfer parameters between them that you can then edit).