Global Petascale to Exascale Workflows for Data Intensive Science Accelerated
by Next Generation Programmable SDN Architectures and Machine Learning Applications

Submitted on behalf of the teams by: Harvey Newman, Caltech,

We will demonstrate several of the latest major advances in software defined and Terabit/sec networks, intelligent global operations and monitoring systems, workflow optimization methodologies with real-time analytics, and state of the art long distance data transfer methods and tools and server designs, to meet the challenges faced by leading edge data intensive experimental programs in high energy physics, astrophysics, climate and other fields of data intensive science. The key challenges being addressed include: (1) global data distribution, processing, access and analysis, (2) the coordinated use of massive but still limited computing, storage and network resources, and (3) coordinated operation and collaboration within global scientific enterprises each encompassing hundreds to thousands of scientists.

The major programs being highlighted include the Large Hadron Collider (LHC), the Laser Interferometer Gravitational Observatory (LIGO), the Large Synoptic Space Telescope (LSST), the Event Horizon Telescope (EHT) that recently released the first black hole image, and others.

Several of the SC19 demonstrations will include a fundamentally new concept of “consistent network operations,” where stable load balanced high throughput workflows crossing optimally chosen network paths, up to preset high water marks to accommodate other traffic, provided by autonomous site-resident services dynamically interacting with network-resident services, in response to demands from the science programs’ principal data distribution and management systems.

Some of the cornerstone system concepts and components to be demonstrated include:

•      Integrated operations and orchestrated management of resources: Absorbing and advancing the site (DTN-RM) and network resource managers (Network-RM) developed in the SENSE [9] program, edge SDN control, NDN routing and caching, transfer tools control, packet re-routing at network level and real-time resource control over site facilities and instruments.

•      Fine-grained end-to-end monitoring and data collection, with a focus on the edges and end sites, enabling data analytics-assisted intelligent and automatic decisions driven by applications supported by optimized path selection and load balancing mechanisms driven by machine learning.

•       An ontological model-driven framework with integration of an analytics engine, API and workflow orchestrator extending work in the SENSE project, enhanced by efficient multi-domain resource state abstractions and discovery mechanisms.

•        Adapting NDN for data intensive sciences including advanced cache design and algorithms and parallel code development and methods for fast and efficient access over a global testbed, leveraging the experience in the SDN Assisted NDN for Data Intensive Experiments (SANDIE; NSF CC*) project.

•      A paragon network at several our partners’ sites composed of P4 programmable devices, including Tofino-based switches and Xilinx FPGA-based smart network interfaces providing packet-by-packet inspection, agile state tracking, real-time decisions and rapid reaction as needed.

•      High throughput platform demonstrations in support of workflows for the science programs mentioned. This will include reference designs of NVMeOF server systems to match a 400G network core, and comparative studies of servers with multi-GPUs and programmable smart NICs with FPGAs.

•      Integration of edge-focused extreme telemetry data (from P4 switches and end hosts) and end facility /application caching stats and other metrics data to facilitate automated decision-making process.

•       Development of dynamic regional caches or “data lakes” that treat nearby as a unified data resource, building on the successful petabyte cache currently in operation between Caltech and UCSD based on the XRootD federated access protocol; extension of the cache concept to more remote sites such as KISTI and KASI in Korea, and TIFR (Mumbai). Applications of the caches to support the LSST science use case and the use of PRP/TNRP distributed GPU clusters for machine learning and related applications.

•      Blending the above innovations with CMS petabyte regional caches and real-time joint-sky-survey analysis data services with a new level of end-to-end performance. This will also help define the near-future workflows, software systems and methods for these science programs.

•      System and application optimizations using the latest graphical representations and deep learning methods

This will be empowered by end-to-end SDN methods extending all the way to autoconfigured Data Transfer Nodes (DTNs), including intent-based networking APIs combined with transfer applications such as Caltech’s open source TCP based FDT which have been shown to match 100G long distance paths at wire speed in production networks. During the demos, the data flows will be steered across regional, continental and transoceanic wide area networks through the orchestration software and controllers, and automated through orchestration software and controllers such as the Automated GOLE (AutoGOLE) controlled through NSI and its MEICAN frontend, and automated virtualization software stacks developed in the SENSE, PRP/TNRP and Chase-CI, AmLight, and other collaborating projects. The DTNs employed will use the latest high throughput SSDs and flow control methods at the edges such as FireQoS and/or Open vSwitch, complemented by NVMe over fabric installations in some locations.

Elements and Goals of the Demonstrations

•  LHC: End to end workflows for large scale data distribution and analysis in support of the CMS experiment’s LHC workflow among Caltech, UCSD, LBL, Fermilab and GridUNESP (Sao Paulo) including automated flow steering, negotiation and DTN autoconfiguration; bursting of some of these workflows to the NERSC HPC facility and the cloud; use of unified caches to increase data access and processing efficiency. 

•  AmLight Express and Protect (AmLight-ExP) in support of the LSST and LHC-related use cases  will be shown, in association with high-throughput low latency experiments, and demonstrations of auto-recovery from network events, using optical spectrum on the new Monet submarine cable, and its 100G ring network that interconnects the research and education communities in the U.S. and South America. For the LSST use case, real time representative low latency transfers for scientific processing of multi-GByte images from the LSST/AURA site in La Serena, Chile, flowing over the REUNA Chilean as well as ANSP and RNP Brazilian national circuits and the Amlight Atlantic and Pacific Ring and Starlight to the conference site are planned, using 300G of capacity  between Miami and Sao Paulo, and 200G between Miami and the SC19 exhibit floor.

•  SENSE The Software-defined network for End-to-end Networked Science at Exascale (SENSE) research project is building smart network services to accelerate scientific discovery in the era of ‘big data’ driven by Exascale, cloud computing, machine learning and AI. The SENSE SC19 demonstration showcases a comprehensive approach to request and provision end-to-end network services across domains that combines deployment of infrastructure across multiple labs/campuses, SC booths and WAN with a focus on usability, performance and resilience through:

•     Intent-based, interactive, real time application interfaces providing intuitive access to intelligent SDN services for Virtual Organization (VO) services and managers;

•     Policy-guided end-to-end orchestration of network resources, coordinated with the science programs’ systems, to enable real time orchestration of computing and storage resources.

•     Auto-provisioning of network devices and Data Transfer Nodes (DTNs);

•     Real time network measurement, analytics and feedback to provide the foundation for full lifecycle status, problem resolution, resilience and coordination between the SENSE intelligent network services, and the science programs’ system services.

•     Priority QoS for SENSE enabled flows

•     Multi-point and point-to-point services

•  Multi-Domain, Joint Path and Resource Representation and Orchestration (Mercator-NG): Fine-grained interdomain routing system (e.g., SFP) and network resource discovery systems (e.g., Mercator) were designed to discover network path and resource information individually in collaborative science networks. Integrating such information is crucial for optimal science workflow orchestration, but a non-trivial task due to the exponential number of possible path-resource combinations even in a single network. The Yale, IBM, ESNet and Caltech team will demonstrate Mercator-NG, the first multi-domain, joint path and resource discovery and representation system. Compared with the original Mercator system published and demonstrated in SC’18, Mercator-NG provides two key novel features, including (1) a fine-grained, compact linear algebra abstraction to jointly represent network path and resource information without the need of enumerating the exponential number of paths in the network; and (2) an efficient science workflow orchestrator to optimize science workflow with the collected network path and resource information. This demonstration will include: (1) efficient discovery of available network path and resource information in a multi-domain wide-area collaborative science network, including Los Angeles, Denver and New York, with extreme low latency, (2) optimal, online science workflow orchestration in this wide-area network, and (3) scaling to collaborative networks with hundreds of members.

•   Control Plane Composition Framework for Inter-domain Experiment Networks: This team will demonstrate Carbide, a novel control plane (CP) composition framework for inter-domain LHC experimental network deployment to achieve collaborative network with both scientific and campus/domain-specific network. The demonstration will include three key features of the framework. 1) High Security: It composes different layers of CPs, each of which is associated with a real-time, distributed verification model to guarantee the desired traffic policy is not violated. 2) High Reliability: When the LHC CP is crushed or link failure, the framework can use underlay CP as a backup, without affecting any policies. 3) Flexibility: It allows a) partially specified CP for LHC, and b) modularity of existing CP/network, so the LHC CP can be deployed in a plug-in and incremental manner. The virtual LHC CP is specified by the users and includes both intradomain and interdomain. LHC CP can coexist with any existing/instantiated CP underlay.

•  NDN Assisted by SDN: Northeastern, Colorado State and Caltech will demonstrate Named Data Networking (NDN) based workflows, accelerated caching and analysis in support of the LHC and climate science programs, in association with the SANDIE (SDN Assisted NDN for Data Intensive Experiments) project.  Specifically, we will demonstrate (1) increased throughput over (the high speed DPDK-based) NDN network using our OSS NDN based XRootD plugin and the NDN producer, (2) a new implementation of caching which enables multiple types of storage devices, (3) an extended testbed topology with additional node at Northeastern University, and (4) an adaptive optimized joint caching and forwarding algorithm over the SANDIE testbed.

•  FPGA-accelerated Machine Learning Inference for LHC Trigger and Computing: UCSD and MIT will lead a group of collaborators demonstrating real-time FPGA-accelerated machine learning inference. Machine learning is used in many facets of LHC data processing including the reconstruction of energy deposited by particles in the detector. The training of a neural network for this purpose with real LHC data will be demonstrated. The model deployment and acceleration on a Xilinx Alveo card using a custom compiler called hls4ml will also be shown. An equivalent setup utilizing an NVIDIA GPU will also be presented allowing for direct comparison. This demonstration will serve to illustrate a first prototype of a new approach to the real-time triggering and event selection with LHC data aimed at meeting the challenges of the second phase of the LHC program, the High Luminosity LHC, which is planned to run from 2026-2037, following the development of further prototypes following this approach during the upcoming LHC data taking runs in 2021-2023.

•  400GE First Data Networks: Caltech, Starlight/NRL, USC, SCinet/XNET, Ciena, Mellanox, Arista, Dell, 2CRSI, Echostreams, DDN and Pavilion Data, as well as other supporting optical, switch and server vendor partners will demonstrate the first fully functional 3 X 400GE local ring network as well as 400GE wide area network ring, linking the Starlight and Caltech booths and Starlight in Chicago. This network will integrate storage using NVMe over Fabric, the latest high throughput methods, in-depth monitoring and realtime flow steering. As part of these demonstrations, we will make use of the latest DWDM, Waveserver Ai, and 400GE as well as 200GE switch and network interfaces from Arista, Dell, Mellanox and Juniper as part of this core set of demonstrations.


The partners will use approximately 15 100G and other wide area links coming into SC19, and the available on-floor and DCI links to the Caltech and partner booths. An inner 1.2 Tbps (3 X 400GE) core network on the showfloor will be composed linking the Caltech, SCinet, Starlight and potentially other partner booths, in addition to several other booths each connected with 100G links. Waveserver Ai and other data center interconnects and DWDM to SCinet. For example, the network layout highlighting the Caltech and Starlight booths, SCinet, and the many wide area network links to partners’ lab and university home sites can be seen here:
The SC19 optical DWDM installations in the Caltech booth and SCinet will build on this progress and incorporate the latest advances.


Physicists, network scientists and engineers from Caltech, Pacific Research Platform, Fermilab, FIU, UERJ, UNESP, Yale, Tongji, UCSD, UMaryland, LBL/NERSC, Argonne, KISTI, Michigan, USC, Northeastern, Colorado State, UCLA, TIFR (Mumbai), SCinet, ESNet, Internet2, StarLight, ICAIR/ Northwestern, CENIC, Pacific Wave, Pacific Northwest GigaPop, AmLight, ANSP, RNP, REUNA, NetherLight, SURF, and their science and network partner teams, with support from Ciena, Intel, Dell, 2CRSI, Premio/Echostreams, Arista,  Mellanox, and Pavilion Data