Hardware Setup
Below is a short summary of the hardware setup used by the team during the SC14 demonstration.
PDF file can also be downloaded here.
Multiple storage servers equipped with high speed HBAs and SSDs were used. These servers were spread between three booths: Caltech, LAC/iCAIR and Vanderbilt.
Each of the servers had 36 SSDs, split between 3 disk controllers (12 SSDs/controller). The capacities of the SSDs varied from storage server to storage server and were between 128 GB and 256GB. Of the network side, each of the servers were equipped with two 40GE NICs.
The storage servers ran on CentOS 7.
Below was the final network mapping among various pairs of storage servers:
Node 1 | Node 2 |
Caltech 1 | iCAIR 1 |
Caltech 2 | iCAIR 2 |
Caltech 3 | Vanderbilt 1 |
Caltech 4 | Vanderbilt 2 |
Caltech 5 | iCAIR 3 |
Caltech 6 | Vanderbilt 3 |
Caltech 7 | Vanderbilt 5 |
Caltech 10 | iCair 4 |
Caltech 12 | iCair 7 |
Caltech 13 | iCair 6 |
Software setup
1) PhEDEx installation
A normal PhEDEx installation consists of:
– a central database (usually at CERN)
– a data service/website (usually located at CERN)
– a set of central agents (usually running on the T0)
– a set of site agents running for a given PhEDEx site
– various datasets located on the different PhEDEx sites
– transfer of files is usually done using FTS
For this demonstration, we adapted the following installation to better suit our needs:
– one PhEDEx site = a NIC in a storage server.
This resulted in having 2 PhEDEx sites per storage server (ex. T2_Caltech_1_1 & T2_Caltech_1_2)
Doing so, allowed us to control on which of the interfaces we wanted to transfer to and from on a given storage server.
– one set of site agents handled work for all the nodes in one booth.
This meant we just had to manage 3 sets of site agents (instead of 40 – one for each site)
– each set of site agents ran in its own virtual machine.
– every dataset had to be specially constructed in order to use all controllers equally
We had ensure that when a new transfer (batch) would start, that transfer would contain files spread equally on all three disk controllers. This was done to ensure highest possible read performance.
– the TFC [1] was adapted to distribute files equally on the destination side.
In order to get highest possible write speed, we had to guarantee that when one file left controller 1 on the source side, that file would be written on controller 1 on the destination. When given a list of files, FDT automatically starts multiple readers/writers based on the number of source/destination disks.
– FDT was used as a transfer tool
Since PhEDEx can’t use it directly, we relied on a 3rd party wrapper called fdtcp. This wrapper had to be installed on each of the VMs and on all the storage servers. This wrapper was also the source of some issues we’ll present later on.
– servers used were designed for extremely high speed, not solely for the storage space
At line rate (40Gbps) a 20GB file takes around 4 seconds to transfer. If we wanted longer transfers we needed a lot of files, but unfortunately we were limited in storage space. Because of this, we created a special script which filled 40% of the free space with real files (with random data in them). The script then created as many symlinks as were needed to those real files.
That script also created an XML file containing the description of the dataset to be used in PhEDEx.
Here are some of the decisions taken with regards to file/block/dataset sizes.
Item | Size | Remarks |
File | 20 GB | |
Block | 20 TB | 1000 files/block |
FileRouter window | 100 TB | ~ 5x block size |
Dataset | 100 TB | 5*blocks |
Batch (# of files per FDT transfer) | 5 TB | 250*files |
Item | Transfer duration[2] |
File | 4 seconds |
Cron job | 2 minutes |
Disk full | 4 to 8 minutes |
Batch | 16min40s |
Block | 66 minutes |
Dataset | 5h30min |
2) FDTCP
As previously mentioned FDTCP (wrapper for FDT written in Python) was installed on all storage servers and on the PhEDEx servers as well. It consists of several components: an executable (fdtcp), a daemon running on the host (fdtd) and the FDT tool itself that it uses for transfers.
Ex: fdtcp fdt://serverIP1:port/pathToFile fdt://serverIP2:port/pathToFile
Here is the flow when PhEDEx issues a transfer from one storage server to another:
1. fdtcp on PhEDEx VM contacts the fdtcp daemons on the storage servers (via IP1 and IP2)
2. fdtcp on the destination storage server starts the FDT server (listening on all interfaces)
3. fdtcp on the source storage server starts the FDT client, instructing it to connect to IP2 (server)
4. transfer begins between the two servers
One issue that we encountered stemmed from the fact that the 40G NICs were in a private network and thus unreachable from the management network to which the PhEDEx VMs had access.
This meant that fdtcp could only issue transfers on the management interface. If we were to provide the private IPs in the fdtcp command, then as soon as fdtcp would try to contact the fdtcp daemons (as per point 1) it would fail since the addresses were not reachable on the PhEDEx VM on which the command was issued.
In order to overcome this issue, fdtcp was modified which allowed it to receive commands on the management interface, and issue transfers on the high speed links (as per figure below):
In this modified version:
1. fdtcp on PhEDEx VM contacts fdtcp daemons on the storage servers via the management interface
2. fdtcp on the destination storage server starts the FDT server
3. fdtcp on the source storage server starts the FDT client
4. transfer begins between the two servers on the high speed links
3) Global architecture of the PhEDEX installation
4) Dynamic bandwidth allocation
PhEDEx, through ANSE, is now able to make use of infrastructures which support dynamic circuits and bandwidth allocation.
For the SuperComputing demo, significant effort has been put in adapting the software to handle not only the concept of circuits but also the concept of dynamic bandwidth.
This table highlights some of those differences:
Circuits | Bandwidth | |
IPs change when resource is allocated | yes | no |
Allocated bandwidth can be changed on the fly | no | yes |
Resource has a finite lifetime | yes | not necessarily |
The module which handles the lifecycles of such resources is called ResourceManager and it usually works at the site level. This means that currently it’s used by the FileDownload site agent.
For the SuperComputing demo, we extracted this module and planned to use it directly without any intervention from a PhEDEx agent. To do this we planned to use the REST API which was already put in place. By doing this we showed that the module can be successfully used independently of PhEDEx, by external software like PanDA.
Unfortunately, due to time constraints, we could not run any transfers showing dynamic allocation of bandwidth.
Issues encountered
PhEDEx specific issues
– Delay of up to 30 minutes between when a transfer command is issued and when the transfer actually starts, limited our capability to troubleshoot errors. This delay stems from the various timings on which the agents run, ultimately dictated by the fact that it uses a single central database. We attempted to reduce this delay, but this eventually required a fairly major overhaul of PhEDEx, something that was not possible in the limited time we had in preparing this demo.
This however, remains a point which CMS feels that needs to be addressed in the future.
– Some of the error messages given by PhEDEx were either cryptic or non-existent.
For SC, when we first added the PhEDEx site node names into the database, we (incorrectly) used the SC_ prefix. We used the standard script provided in PhEDEx to add nodes without getting any errors. The website appeared to function correctly, showing all the nodes, the links between them and even the datasets associated with them. However whenever we attempted to issue a transfer request between any nodes a JavaScript error was thrown: “Node does not pass callback”. No other error was logged be the web service.
Realizing that the node names were not matching some required regex, we switched to the usual node prefix T2_ . Although we initially provided incorrect node names, the system never complained until it was too late: the script that injected the nodes did not check for name correctness, the database did not have a trigger to protect against such errors and no exception was caught in the code when the error was thrown. The anomalous behaviour only arose when issuing requests.
– After this incident, we faced a new issue. Although we reset the schema and added correct nodes in PhEDEx, when we attempted to create a transfer request, the box displaying destination sites did not have the updated names (they still showed as SC_). We eventually discovered that the web service has its own cache, which can take up to 12 hours to purge and which is not cleared when the web service is restarted – it has to be manually erased.
FDTCP issues
– One of the important issues we discovered during the demo was already presented when we introduced FDTCP earlier in this article
– Grid authentication module had to be removed for the purposes of this demo: it cut initial transfer delays significantly and it removed the issue of producing certificates for every machine that it ran on
– Going from CentOS6 to CentOS7 was not straightforward. New RPMs had to be created.
Miscellaneous
– Chef didn’t always install everything on all the machines.
Specifically on some machines it did not add lines required by FDTCP in /etc/sudoers.
There were one or two machines which were also excluded from the Chef configuration (thus FDTCP was not automatically installed on them)
– We used a CRON job which was removing files from the destination folder in order to keep the server from getting full. Oddly, this occasionally interfered with FDT, just when it was about to finish a transfer (when it was renaming the temporary file it had written to, to the normal file name).
– Various connectivity issues:
o On one pair of servers, one of the NICs had some issues: ping worked, but nothing else.
o On a different pair of nodes the transfer was interrupted when one of the lambdas on the Padtecs was taken down.
o Some pairs of servers exhibited performance issues when doing disk to disk transfers. Throughput would initially be high, but it would eventually drop and stabilize to around 5Gbps (on a 40Gbps interface). This behaviour indicates that further work is needed with the CentOS 7 operating system in conjunction with the latest disk and network interface drivers.
[1] TFC maps logical file names (LFNs) to physical file names (PFNs), and vice-versa. LFNs are what PhEDEX uses internally.
[2] Duration is calculated assuming a transfer rate of 40Gbps (5GB/sec)