IEEE Hot Interconnects

Tutorial 1

Title

How to Optimize and Run MPI Workloads on AWS with our Latest Services

Speaker

Linda S. Hedges and Raghu Raja (Amazon Web Services)

Abstract

Come to a hands-on workshop designed to show and explain the essentials of optimizing HPC applications on AWS. The tutorial starts with an introduction to common workloads run in the cloud and a discussion on common AWS services, instance types, storage and networking options that target HPC workloads. A hands-on tutorial will walk through the set up and running of a common HPC workload which relies heavily on the network. Elastic Fabric Adapter, EFA, is an AWS network interface designed specifically for HPC applications requiring high levels of inter-instance communications such as computational fluid dynamics, weather modeling, and reservoir simulation. It uses a custom-built operating system bypass technique to enhance the performance of inter-instance communications, which is critical to scaling HPC applications. With EFA, HPC applications using popular HPC technologies like Message Passing Interface (MPI) can scale to thousands of CPU cores. You'll learn how to implement EFA to get maximum scalability for your workloads.

Bio

Linda Hedges

Linda Hedges is an HPC Principal Solutions Architect for Amazon Web Services. Her emphasis is on the understanding the scalability of tightly-coupled workloads. With 25 plus years of experience, Linda Hedges' career has focused on state-of-the-art capabilities in the area of Computational Fluid Dynamics analysis and automation. She has extensive management and project management experience, achieving the position of President of Stark Aerospace's engineering division. She held the position of Associate Technical Fellow while at the Boeing Company. While at Blue Origin, she developed CFD methods for rocket vehicle and propulsion development. This included reacting hypersonic aerodynamics, combusting thrust chambers, complex turbomachinery, and full-vehicle flow simulations incorporating propulsive flow interactions. Linda Hedges obtained a PhD in Aeronautics and Astronautics from the University of Washington in 1991.

Raghu Raja

Raghu Raja is a Senior Engineer at Amazon Web Services where he helps build technologies that enable customers run their HPC applications more efficiently in the cloud. Prior to AWS, he was a Senior Engineer at Cray, conducting research on next-generation HPC storage technologies as part of an Advanced Development effort. Dr. Raja received his PhD from The Ohio State University. He has authored more than 20 scientific publications, presented talks at several international venues, and served on the Technical Program Committees of multiple HPC conferences and workshops.

Tutorial 2

Title

System Innovation in DCI Transport Networks

Speakers

Loukas Paraschis and Abhinava Shivakumar Sadasivarao (Infinera)

Abstract

Traffic interconnecting data centers (DCI) has grown more than any other transport network traffic type, and has been projected to grow by at least 2 more orders of magnitude. The economics of this growth motivated the building of dedicated DCI networks, with some of the most spectrally efficient fiber deployments globally. It also motivated a new class of purpose-built DCI- optimized routing and optical transport systems. Hence, DCI has been the most significant evolution in transport networking this decade, and arguably since the earlier major transitions from TDM to IP/MPLS and WDM.

This tutorial reviews the most important DCI innovations, and their increasingly important role more generally in transport networks. Notably, it reviews the main DCI network requirements, and the associated optimization in routers that focus on maximizing throughput rather than routing scale, and high capacity, typically point-to-point, WDM systems that have been the first to employ state-of-the-art coherent transmission. DCI has also pioneered in transport the extensive adoption of software innovations in automation, programmability, management abstraction, and control-plane disaggregation, typically referred collectively as "SDN", and of the associated "open" transport architectures. Moreover, DCI is driving significant emerging innovations, including 400GE coherent WDM "ZR" pluggables in DCI routers, or the potential value from network optimization and traffic engineering based on network analytics. We discuss the value of these innovations, and the associated trade-offs, along with future research topics, and related emerging standards.

Bio

Loukas Paraschis

Loukas Paraschis has been Senior Director for cloud transport system engineering at Infinera since 2016. Before, Loukas was Cisco's senior technology architect and business development manager for wireline transport in global service provider, and earlier was product line manager and technical leader in Cisco's routing and optical networking. Loukas graduated from Stanford University (PhD 1999, MS 1998) where he worked at the Information Systems and Networking Research laboratories. He has (co)authored more than 100 peer-reviewed publications, invited and tutorial presentations, book chapters, and 5 patents, has served in many IEEE and OSA leadership positions, including the OFC and JOCN steering committees, as JOCN associated editor, IEEE Photonics Society Distinguished Lecturer (2009), and is an OSA Fellow (2011). He was born in Athens, Greece, where he completed his undergraduate studies.

Abhinava Shivakumar Sadasivarao

Abhinava Shivakumar Sadasivarao is Staff System Architect at Infinera. He has been with Infinera since 2012 and works in the Systems Architecture group focusing on system requirements' specification and software architecture for Infinera's DCI/Cloud platforms, embedded network operating system and security. In the past, he was involved in multiple first-of-a-kind PoCs of SDN applicability to optical transport, resulting in numerous vendor interop demonstrations. He completed his graduate studies at Carnegie Mellon University (MS '12) and has (co-)authored multiple peer-reviewed (and invited) publications at IEEE, ACM and OSA conferences. Abhinav hails from the beautiful garden city Bengaluru (India) where he completed his undergraduate studies.

Tutorial 3

Title

The CODES/TraceR Framework for Continuing Innovation of HPC Interconnects

Speakers

Nikhil Jain (Nvidia) and Neil McGlohon (Rensselaer Polytechnic Institute)

Abstract

With the frontier of exascale-level high-performance computing (HPC) upon us, it is becoming ever more crucial to obtain accurate and reliable predictions of prospective interconnect performance The cost associated with building a new HPC system puts great risk on relying solely on analytical estimates and metrics. Full-scale simulation of network interconnects with a broad variety of workloads and configurations can grant crucial insight into the viability of prospective designs.

This tutorial will introduce CODES/TraceR. It is a flexible interconnect simulation framework built on top of the ROSS parallel-discrete-event-simulation (PDES) environment. We will present the capabilities of this framework, describing how these capabilities can be used to predict real-world interconnect ability and performance with minimal effort. Additionally, the tutorial will cover recent additions to the CODES framework, specifically support for Intel Scalable Workload Model (SWM) online workloads and Quality of Service features through traffic classes.

The tutorial will include a from-the-ground-up setup and execution procedure and present case studies of recent work exhibiting how the framework can be used to help innovate in the area of HPC system interconnects.

Nikhil Jain

Nikhil Jain is a Computer Architect in Nvidia and is interested in design and deployment of optimized parallel systems. He currently works on hardware and software performance optimizations, GPU architecture, interconnect networks, and scalable application development. Nikhil received a Ph.D. degree in Computer Science from the University of Illinois at Urbana- Champaign in 2016, and B.Tech. and M.Tech degrees in Computer Science and Engineering from I.I.T. Kanpur, India in May 2009. He is a recipient of the LLNL Sidney Fernbach postdoctoral fellowship, the IBM PhD fellowship, and the UoI Andrew and Shana Laursen fellowship. Prior to Nvidia, he has worked at Lawrence Livermore National Laboratory and IBM Research.

Neil McGlohon

Neil McGlohon is a PhD student at Rensselaer Polytechnic Institute. His research focus is in the simulation and evaluation of HPC communication networks and acts as an active maintainer of the CODES network simulation framework. In addition to his own studies, his work contributes to the Department of Energy's Exascale Computing Project.

Tutorial 4

Title

HPC meets distributed deep learning

Speakers

D.K. Panda, Ammar Ahmed Awan, and Hari Subramoni (Ohio State University)

Abstract

The recent advances in Deep Learning (DL) has led to many exciting challenges and opportunities for CS and AI re- searchers alike. Modern DL frameworks like TensorFlow, PyTorch, and several others have emerged that offer ease of use and flexibility to train, and deploy various types of Deep Neural Networks (DNNs). In this tutorial, we will provide an overview of interesting trends in DNN design and how cutting-edge hardware architectures and high-performance intercon- nects are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and DL frameworks. Most DL frameworks started with a single-node design. However, approaches to parallelize the pro- cess of DNN training are also being actively explored. The DL community has moved along different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. We highlight new challenges and opportunities for communication runtimes to exploit high-performance interconnects and efficiently support large-scale distributed DNN training. We also highlight some of our co-design efforts to utilize CUDA-Aware MPI for large-scale DNN training on GPU clusters. Finally, we include hands-on exercises to enable the attendees to gain first-hand experience of running distributed DNN training experiments on a modern GPU cluster.

Bio

D.K. Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,000 organizations worldwide (in 89 countries). More than 555,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 3 rd , 5 th , 8 th , 16 th , and 19 th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 315 organizations in 35 countries. More than 30,650 downloads of these libraries have taken place. High-performance and scalable versions of the Caffe and TensorFlow framework are available from https://hidl.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio- state.edu/~panda.

Ammar Ahmad Awan

Ammar Ahmad Awan received his B.S and M.S degrees in Computer Science and Engineering from National University of Science and Technology (NUST), Pakistan and Kyung Hee University (KHU), South Korea, respectively. Currently, Ammar is working towards his Ph.D. degree in Computer Science and Engineering at The Ohio State University. His current research focus lies at the intersection of High Performance Computing (HPC) libraries and Deep Learning (DL) frameworks. He previously worked on a Java-based Message Passing Interface (MPI) and nested parallelism with OpenMP and MPI for scientific applications. He has published 20 papers in conferences and journals related to these research areas. He actively contributes to various projects like MVAPICH2-GDR (High Performance MPI for GPU clusters, OMB (OSU Micro Benchmarks), and HiDL (High Performance Deep Learning). He is the lead author of the OSU-Caffe framework (part of HiDL project) that allows efficient distributed training of Deep Neural Networks. More details about Ammar are available at http://www.cse.ohio-state.edu/~awan.10.

Hari Subramoni

Dr. Hari Subramoni received the Ph.D. degree in Computer Science from The Ohio State University, Columbus, OH, in 2013. He is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, Exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 50 papers in international journals and conferences related to these research areas. Recently, Dr.

Subramoni is doing research and working on the design and development of MVAPICH2, MVAPICH2-GDR, and MVAPICH2-X software packages. He is a member of IEEE. More details about Dr. Subramoni are available from http://www.cse.ohio-state.edu/~subramon.