Home| Program | Keynotes | Registration | Attendees | Tutorials | Committees | Sponsors | Travel Awards | Archive | Contact


Wednesday, August 14th (Tutorials)
8:00-8:30 Breakfast and Registration

Break: 10:00-10:30

Tutorial 1

How to optimize and run MPI workloads on AWS with our Latest Services

Linda Hedges and Raghu Raja, Amazon Web Services

Tutorial 3

The CODES/TraceR Framework for Continuing Innovation of HPC Interconnects

Nikhil Jain, Nvidia; Neil McGlohon, Rensselaer Polytechnic Institute

12:00-13:30 Lunch

Break: 15:00-15:30

Tutorial 2

System Innovation in DCI Transport Networks

Loukas Paraschis and Abhinava Shivakumar Sadasivarao, Infinera

Tutorial 4

HPC meets distributed deep learning

D.K. Panda, Ammar Ahmed Awan, and Hari Subramoni, Ohio State University

Thursday, August 15 (Symposium)
8:00-8:45 Breakfast and Registration
8:45-9:00 Introduction
Don Draper & Eitan Zahavi General Chairs
9:00-9:10 Host Opening Remarks
Mike Zeile, Intel
Session Chair:

From Microns to Miles - The Broad Spectrum of Intel's Interconnect Technology Strategy
Uri Cummings, CTO of DCG Connectivity Group, Intel
10:15-10:30 Morning Break
Session Chair:
  • The First Supercomputer with HyperX Topology: A Viable Alternative to Fat-Trees?

    The state-of-the-art topology for modern supercomputers are Folded Clos networks, a.k.a. Fat-Trees. The node count in these massively parallel systems is steadily increasing. This forces an increased path length, which limits gains for latency-sensitive applications, because the port count of modern switches cannot be scaled accordingly. Additionally, the deployment methodology for today's Fat-Trees requires the extensive use of costly active optical cables. A novel, yet only theoretically investigated, alternative is the low-diameter HyperX. To perform a fair side-by-side comparison between a 3-level Fat-Tree and a 12x8 HyperX, we constructed the world's Žrst 3 Pßop/s supercomputer with these two networks. We show through a variety of benchmarks that the HyperX, together with our novel communication pattern-aware routing, can challenge the performance of traditional Fat-Trees.

    Authors affliliation:
    RIKEN Center for Computational Science (R-CCS), Japan
    Tokyo Institute of Technology*, Japan
    Hewlett Packard Enterprise (HPE)†, USA

    J.Domke, S. Matsuoka, I. R. Ivanov*, Y. Tsushima*, T. Yuki*, A. Nomura*, S. Miura*, N. McDonald†, D. L. Floyd†, and N. Dubé†
  • RPath2SL: Optimizing Head-of-Line Blocking Reduction in InfiniBand-based Fat-tree Networks

    The interconnection network is a key element in high-performance computing (HPC) and Datacenter (DC) systems, as it must support the communication among the endnodes, whose number constantly increases. Hence, guaranteeing a suitable network performance is crucial, as otherwise the network would become the entire system bottleneck. Network performance depends on several design issues: topology and routing, switch architecture, interconnect technology, etc. Among the available interconnect technologies, InfiniBand is a prominent one. Infiniband components and control software allow to implement efŽcient topologies and routing algorithms, as well as queuing schemes that reduce the Head-of-Line (HoL) blocking effect derived from congestion situations. In this paper we present a new queuing scheme called Path2SL, that optimizes the use of the InŽniBand Virtual Lanes (VLs) to reduce HoL blocking in Fat-Tree network topologies. We have implemented PathSL in the control software of a real Infiniband-based cluster. The experiment results obtained from real workloads run in this cluster show that Path2SL is a more efficient queuing scheme than others previously proposed to deal with HoL blocking in the analysed network conŽgurations.

    Authors affliliation:
    Universidad de Castilla-La Mancha, Spain

    G. Maglione-Mathey, J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, and J. Duato
  • High-Quality Fault-Resiliency in Fat-Tree Networks

    Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete re-routing of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software (OpenSM) first for routing execution time to show feasibility at scale, and then for congestion risk under degradation to demonstrate robustness. The latter comparison is done using static analysis of routing tables under random permutation (RP), shift permutation (SP) and all-to-all (A2A) traffic patterns. Results for Dmodc show A2A and RP congestion risks similar under heavy degradation as the most stable algorithms compared, and near-optimal SP congestion risk up to 1% of random degradation.

    Authors affliliation:
    Versailles Saint-Quentin-en-Yvelines University (UVSQ), France
    Atos*, France
    Universidad de Castilla-La Mancha†, Spain

    J. Gliksberg, A. Capra*, A. Louvet*, P. J. Garcia†, and D. Sohier
12:00-13:00 Lunch
Session Chair:
  • Versal Network-on-Chip (NoC)

    Xilinx Versal Adaptable Compute Acceleration Platform (ACAP) is a new software-programmable heterogenous compute platform. The slowing of Moores law and the ever-present need for higher levels of compute performance has spurred the development of many domain speciŽc accelerator architectures. ACAP devices are well suited to take advantage of this trend. They provide a combination of hardened heterogenous compute and IO elements and programmable logic. Programmable logic allows the accelerator to be customized in order to accelerate the whole application. The Versal Network-on-Chip (NoC) is a programmable resource that interconnects all of these elements. This paper outlines the motivation for a hardened NoC within a programmable accelerator platform and described the Versal NoC.

    Authors affliliation:
    Xilinx Inc, USA

    I. Swarbrick, D. Gaitonde, S. Ahmad, B. Jayadev, J. Cuppett, A. Morshed, B. Gaide, and Y. Arbel
  • Compute Express Link
    S. Van Doren
  • High Capacity On-Package Physical Link Considerations

    Multi-chiplet designs implement ASICs and other integrated products across multiple die within a single package. The ODSA group aims to define an open logical interface such that chiplets from multiple vendors can be composed to form domain-specific accelerators. As a part of this effort, the ODSA surveyed and analyzed a wide range of new inter-chiplet PHY technologies. This paper reports the results of the survey. We develop a framework to evaluate these PHY technologies. Based on our analysis, we propose the use of an abstraction layer so that multiple PHY technologies can present a common interface.

    Authors affliliation:
    zGlue, USA
    Aquantia*, USA
    Netronome†, USA

    G. Taylor, R. Farjadrad*, and B. Vinnakota†
Invited Talks
Session Chair:

  • AI-engines for Real-Time Classification of Data Flows for Optimal Routing in IP Networks

    As more and more interconnect technologies are developed with a range of diverse capabilities, optimal workload management requires the intelligent matching of data traffic types with network capabilities on a dynamic basis. This, in turn, requires fast, real-time classification of data traffic into useful "buckets." This talk introduces the AIbased classification of data flows (as opposed to applications or "intent") by examining the first few IP packet headers. These AI-engines can be tuned differently in different parts of the network, or under different circumstances. The talk will cover use cases in data center networks as well as wireless or wireline carrier networks.

    Hus Tigli, Xaxar
  • Bio

    Hus Tigli has founded Xaxar, his fifth start-up, in 2018 and serves as its Chairman and CEO. His previous four start-ups emerged as leaders in photonics, optical networking and mixed-signal IC's -which were either acquired by or whose technologies licensed to industry giants.

    Prior to his entrepreneurial career starting in 2000, Tigli ran businesses with revenues from $5 million to $900 million at Raychem Corporation, a publicly traded innovator of materials-science based components for electronics and telecom markets.

    Tigli received his BS and MS in engineering at Columbia University and an MBA from Harvard.

  • Heterogeneous Compute Elasticity: Computing Beyond the Box with PCIe Networking

    The slowing down of Moore's-law coupled with the explosion of new storage and compute intensive applications like deep learning and data analytics have created a severe challenge for data center networking in the core and at the edge. In the past, these problems were the domain of High Performance Computing but now enterprise, edge and cloud segments are all impacted. Existing legacy networking technologies like Ethernet, Fibre Channel and Infiniband are unable to deliver the bandwidth and latency performance required especially. Meanwhile, PCIe the ubiquitous connectivity solution for compute and storage inside a server, has remained trapped inside the box. Until now.

    GigaIO's FabreX is a PCIe standards-based network that addresses these challenges. In addition to providing unparalleled latency and bandwidth performance, the fabric can natively support NVMe-oF and GDR devices, removing the extra overhead associated with transferring data over another transport. We will present our S/W and H/W architecture and support for memory semantic capability for emerging SCM. Measured data from internal testing and from San Diego Supercomputer Center (SDSC) will be presented to demonstrate the performance and efficiency benefits of FabreX.

    Scott Taylor, GigaIO Networks
  • Bio

    Scott Taylor has an extensive background in high speed networking, accelerators and security from working at companies like Cray Research and Sun Microsystems. Leveraging this background, he created the FabreX software architecture supporting Redfish Composability Service, NVMe-oF, GPU Direct RDMA, accelerators, MPI and TCP/IP all with a single PCI-compliant interconnect. He has built the engineering team at GigaIO from the ground up to implement a singular vision of FabreX as an open source, standards-based ecosystem. ScottÕs previous experience includes Prisa Networks, a Fiber Channel startup, where he helped drive the shift from an arbitrated loop to switch based topologies. His many years working as an expert consultant helps him drive key intellectual property development at GigaIO. Scott holds a BS in computer science from UC Santa Barbara.

15:10-15:25 Afternoon Break
Invited Talk
Session Chair:

Building Large Scale Data Centers: Cloud Network Design Best Practices

The talk examines the network design principles of large scale cloud networks that allow Cloud Service providers to achieve throughputs in excess of 10 Pbps in a single datacenter.

Andy Bechtolsheim, Arista Networks

As Chief Development Officer, Andy Bechtolsheim is responsible for the overall product development and technical direction of Arista Networks.

Previously Andy was a Founder and Chief System Architect at Sun Microsystems, where most recently he was responsible for industry standard server architecture. Andy was also a Founder and President of Granite Systems, a Gigabit Ethernet startup acquired by Cisco Systems in 1996. From 1996 until 2003 Andy served as VP/GM of the Gigabit Systems Business Unit at Cisco that developed the very successful Catalyst 4500 family of switches. Andy was also a Founder and President of Kealia, a next generation server company acquired by Sun in 2004.

Andy received an M.S. in Computer Engineering from Carnegie Mellon University in 1976 and was a Ph.D. Student at Stanford University from 1977 until 1982. He was coawarded the prestigious "EY 2015 Entrepreneur of the Year" across National USA.

Moderator: Uri Cummings, Intel

Data Center Transformation — How will the DataCenter look Different in 5 Years?
  • How will system architecture change towards connectivity
  • Edge Networking
  • Scalable compute workloads in the cloud
  • Serverless computing and its impact on networking
  • I/O wall and shift towards optics
  • Competing technologies (CXL vs CSIX vs TileLink vs NVLink)

Andreas Bechtolsheim, Arista
Hong Liu, Google
Michael Kagan, Mellanox
Tom Tofigh, QCT

Friday, August 16 (Symposium)
8:15-9:00 Breakfast and Registration
Session Chair:

Rosetta: A 64-port Switch for Cray's Slingshot Interconnect
Steve Scott, Cray
10:00-10:30 Morning Break
Links and FPGAs
Session Chair:
  • Enabling Standalone FPGA Computing

    One of the key obstacles in the advancement of large-scale distributed FPGA platforms is the ability of the accelerator to act autonomously from the CPU, whilst maintaining tight coupling to system memory. This work details our efforts in decoupling the networking capabilities of the FPGA from CPU resources using a custom transport layer and network protocol. We highlight the reasons that previous solutions are insufficient for the requirements of HPC, and we show the performance benefits of offloading our transport into the FPGA fabric. Our results show promising throughput and latency beneŽts, and show competitive Flops being achievable for network dependent computing in a distributed environment.

    Authors affliliation:
    The University of Manchester, UK

    J. Lant, J. Navaridas, A. Attwood, M. Lujan, and J. Goodacre
  • A Bunch of Wires (BoW) Interface for Inter-Chiplet Communication

    Multi-Chiplet system-in-package designs have recently received a lot of attention as a mechanism to combat high SoC design costs and to economically manufacture large ASICs. Multi-Chiplet designs require low-power area-efficient inter-Chiplet communication. Current technologies either extend on-chip high-wire count buses using silicon interposers or off-package serial buses over organic substrates. The former approach leads to expensive packaging. The latter to complex design. We propose a simple Bunch of Wires (BoW) interface that combines the ease of development of parallel interfaces with the low cost of organic substrates.

    Authors affliliation:
    Aquantia, USA
    Netronome*, USA

    R. Farjadrad and B. Vinnakota*
  • Demonstration of a Single-Lane 80 Gbps PAM-4 Full-Duplex Serial Link

    Serial interface standards, such as Thunderbolt, SATA, USB and Ethernet, are constantly being upgraded toa chieve higher speed bi-directional data connectivity for a wide range of applications. To meet this requirement, such links use multiple lanes, and therefore have to deal with near-end and far-end cross-talks, which makes their implementation challenging. In addition, multiple pairs of wires make the interface cables bulky, and the use of hybrid transformers for self-interference (SI) cancellation (as in the case of Ethernet links) makes their form factors unattractive.

    In this work, we propose full-duplexing using broadband SI cancellation techniques to demonstrate a single-lane bi-directional PAM-4 link, with an aggregate data rate of 80 Gbps. The demonstrated link does not require any hybrid transformer for SI-cancellation, and achieves a raw bit-error-rate (BER) of < 10-9 over a 1 m long coaxial cable.

    Authors affliliation:
    Indian Institute of Technology Bombay, India

    S. Goyal, P. Agarwal, and S. Gupta
12:00-13:30 Lunch
Session Chair:
  • Improved MPI Multi-threaded performance using OFI Scalable Endpoints

    Message Passing Interface (MPI) applications are launched as a set of parallel homogeneous processes, commonly with one to one mapping between MPI processes and compute cores. With the growing complexity of MPI applications and compute node processors consisting of large numbers of cores, launching a small number of MPI processes with several lightweight threads per process is becoming popular. Task based programming models in combination with MPI also provide several benefits for the application to exploit intra node parallelism. Naϊve implementation of MPI_THREAD_MULTIPLE can be expensive with minimal or no performance beneŽts. We demonstrate a high-performance end to end multi-threading solution across the MPI application and MPI runtime, with threads mapping to hardware resources. We demonstrate our solution with Open MPI using Libfabric (a.k.a. OpenFabrics Interfaces OFI) and its Intel Omni-Path Performance Scaled Messaging 2 (PSM2) provider. Our tests with Intel® MPI Benchmarks Multi Thread set (IMB-MT) show BW improvement for large message sizes when running with multiple threads. We also demonstrate up to 2.5x performance improvements with Baidu All-Reduce. Even though the experiments were run on Intel® Omni-Path Architecture fabric, the solution can be applied to other fabrics with the capability of allocating resources among multiple threads.

    Authors affliliation:
    Intel, USA

    A. Gopalakrishnan, M. Cabral, J. Erwin, and R. B. Ganapathi
  • Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter

    Amazon has recently announced a new network interface named Elastic Fabric Adapter (EFA) targeted towards tightly coupled HPC workloads. In this paper, we characterize the features, capabilities and performance of the adapter. We also explore how its transport models such as UD and SRD (Scalable Reliable Datagram) impact the design of high-performance MPI libraries. Our evaluations show that hardware level reliability provided by SRD can significantly improve the performance of MPI communication. We also propose a new zero-copy transfer mechanism over unreliable and orderless channels that can reduce the communication latency of large messages. The proposed design also shows significant improvement in collective and application performance against the vendor provided MPI library.

    Authors affliliation:
    The Ohio State University, USA

    S. Chakraborty, S. Xu, H. Subramoni, and D. K. Panda
  • A Study of Network Congestion in Two Supercomputing High-Speed Interconnects

    Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.

    Authors affliliation:
    University of Illinois at Urbana-Champaign, USA
    Sandia National Lab*, USA

    S. Jha, A. Patke, J. Brandt*, A. Gentile*, M. Showerman, E. Roman, Z. Kalbarczyk, B. Kramer, and R. K. Iyer.
15:00-15:15 Afternoon Break
Efficient Network Design and Use
Session Chair:
  • Communication Profiling and Characterization of DeepLearning Workloads on Clusters with High-Performance Interconnects

    Heterogeneous HPC systems with GPUs are increasingly getting equipped with on-node interconnects like PCIe and NVLink and inter-node interconnects like InŽniBand and Omni-Path. However, the efŽcient exploitation of these interconnects brings forth many challenges for MPI+CUDA applications. Little exists in the literature that captures the impact of these interconnects on emerging application areas like distributed Deep Learning (DL). In this paper, we choose Horovod; a distributed training middleware, to analyze and profile high-level application workloads (e.g., Training ResNet-50) instead of MPI microbenchmarks. It is challenging to use existing profilers like mpiP and nvprof as they only offer a black box approach and cannot profile emerging communication libraries like NCCL. To address this, we developed a proŽler for Horovod that enables profiling of various communication primitives including MPIAllreduce and ncclAllreduce for gradient exchange as well for Horovod's communication threads and response caches. We analyze the following metrics to gain insights into network-level performance on different interconnects: 1) Message size with tensor fusion, 2) Message size without tensor fusion, 3) Number of MPI and NCCL calls made for each message size, and 4) Time taken by each NCCL and/or MPI call. We also correlate these low-level statistics to higher level end-to-end training metrics like images per second. Three keys insights we gained are: 1) Horovod tensor fusion offers slight performance gains (up to 5%) for CPU-based training on InfiniBand systems, 2) For GPU-based training, disabling tensor fusion improved performance (up to 17%) for GPUs connected with PCIe, and 3) The allreduce latency profiles show some extreme performance variations for non-power-of- two message sizes for both CPUs and GPU on all interconnects when tensor fusion is enabled. To provide a comprehensive view of performance, we use a wide variety of systems with CPUs like Intel Skylake, AMD EPYC, and IBM POWER9, GPUs like Volta V100, and interconnects like PCIe, NVLink, InŽniBand, and Omni-Path.

    Authors affliliation:
    The Ohio State University, USA

    A. A. Awan, A. Jain, C.-H. Chu, H. Subramoni, and D. K. Panda
  • Lightweight, Packet-centric Monitoring of Network Traffic and Congestion Implemented in P4

    Communication cost is an important factor for distributed applications running in data centers. To improve communication performance, developers need tools that enable them to measure and to understand how their application's communication patterns interact with the network, especially when those interactions result in congestion. This paper describes a lightweight sampling-based technique for monitoring communication that has a switch help a packet collect information about the path it takes from source to destination and congestion it encounters along the way. This scheme has essentially no bandwidth overhead, as it stores only a few bits of information in the header of a monitored IP packet, making it practical to monitor every packet. In our prior work, network simulations of large-scale tightly-coupled HPC applications showed this approach can provide detailed information about traffic and congestion that is useful for diagnosing the problem's root cause. Here, we describe an implementation of this scheme in P4 for data center networks and demonstrate its functionality with a basic experiment.

    Authors affliliation:
    Rice University, USA

    P. Taffet and J. Mellor-Crummey
  • OmniXtend: Direct to Caches over Commodity Fabric

    There is a dearth of interfaces for efficient attachment of new kinds of non-volatile memory and purpose-built compute accelerators to processor pipelines. Early integrated microprocessors exposed an off-chip front-side bus to which discrete memory and peripheral controllers could attach in a standardized fashion. With the advent of symmetric multiprocessing and deep caches, this direct connection, together with memory controllers, has been implemented primarily using proprietary on-die technology. Proprietary interconnects and protocols hinder architectural innovation and are at odds with the open nature of the rapidly growing RISC-V movement.

    In this paper we introduce OmniXtend, a fully open coherence protocol meant to restore unrestricted interoperability of heterogeneous compute engines with a wide variety of memory and storage technologies. OmniXtend supports a four-hop MESI protocol and is designed to take advantage of a new wave of Ethernet switches with stateful and programmable data planes to facilitate system scalability. Ethernet transport was selected as a starting point for its ubiquity and historic resilience to reduce barriers to entry at modern bandwidths and latencies. Moreover, it allows us to build upon a vibrant ecosystem of hardware and IP, and to provide a boost to architectural innovation through the use of feld-reconfigurable networking hardware. We briefly discuss the protocol operation and show performance measurements of the first ever NUMA RISC-V system prototype.

    Authors affliliation:
    Western Digital, USA
    SiFive*, USA

    M. Radi, W. Terpstra*, P. Loewenstein, and D. Vucinic
  • Latency Critical Operation in Network Processors

    This paper presents the recent advancements made on the Advanced-IO-Processor (AIOP), a Network Processor (NPU) architecture designed by NXP Semiconductors. The base architecture consists of multi-tasking PowerPC processor cores combined with hardware accelerators for common packet processing functions. Each core is equipped with dedicated hardware for rapid task scheduling and switching on every hardware accelerator call, thus providing very high throughput. A hardware pre-emption controller snoops on the accelerator completions and sends task pre-emption requests to the cores. This reduces the latency of real-time tasks by quickly switching to the high priority task on the core without any performance penalty. A novel concept of priority-thresholding is further used to avoid latency uncertainty on lower priority tasks. The paper shows that these features make the AIOP architecture very effective in handling the conßicting requirements of high-throughput and low-latency for next-generation wireless applications like WiFi (802.11ax) and 5G. In presence of frequent pre-emptions, the throughput reduces by only 3% on AIOP, compared to 25% on optimized present-day NPU architectures. Further, the absolute throughput and latency numbers are 2X better.

    Authors affliliation:
    NXP Semiconductors, Netherlands

    S. Roy, A. Kaushik, R. Agrawal, J. Gergen, W. Rouwet, and J. Arends
17:35-17:50 Closing Remarks



Camera ready
submission deadline:
June 26