IEEE Hot Interconnects

Tutorial 1

Title

Exploiting High-Performance Interconnects to Accelerate Big Data Processing with Hadoop, Spark, Memcached, and gRPC/TensorFlow

Speakers

Dhabaleswar K. (DK) Panda and Xiaoyi Lu (Ohio State University)

Abstract

The convergence of HPC, Big Data, and Deep Learning is the next game-changing business opportunity. Apache Hadoop, Spark, gRPC/TensorFlow, and Memcached are becoming standard building blocks in handling Big Data oriented processing and mining. Recent studies have shown that default designs of these components can not efficiently leverage the features of modern HPC clusters, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects, high-throughput parallel storage systems (e.g. Lustre), Non-Volatile Memory (NVM). In this tutorial, we will provide an in-depth overview of the architecture of Hadoop, Spark, gRPC/TensorFlow, and Memcached. We will examine the challenges in re-designing networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand, RoCE) and storage architectures. Using the publicly available software packages in the High-Performance Big Data project (HiBD, http://hibd.cse.ohio-state.edu), we will provide case studies of the new designs for several Hadoop/Spark/gRPC/TensorFlow/Memcached components and their associated benefits. Through these, we will also examine the interplay between high-performance interconnects, storage (HDD, NVM, and SSD), and multi-core platforms to achieve the best solutions for these components and applications on modern HPC clusters. We also present indepth case-studies with modern Deep Learning tools (e.g., Caffe, TensorFlow, DL4J, BigDL) with RDMA-enabled Hadoop, Spark, and gRPC.

Bio

Dhabaleswar K. (DK) Panda

Dhabaleswar K. (DK) Panda is a Professor of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing, and Quality of Service. He has published over 350 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and 10GigE/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group, are currently being used by more than 2,400 organizations worldwide (in 75 countries). This software has enabled several InfiniBand clusters (including the 7th one) to get into the latest TOP500 ranking. These software packages are also available with the Open Fabrics stack for network vendors (InfiniBand and iWARP), server vendors and Linux distributors. The new RDMA-enabled Apache Hadoop and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC and Memcached and support for clusters with Lustre file systems, are publicly available from http://hibd.cse.ohio-state.edu. Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda, including a comprehensive CV and publications are available at http://web.cse.ohio-state.edu/~panda.2/.

Xiaoyi Lu

Dr. Xiaoyi Lu is a Research Scientist in the Department of Computer Science and Engineering at the Ohio State University, USA. He obtained his Ph.D. degree in Computer Science from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His current research interests include high-performance interconnects and protocols, Big Data, Hadoop/Spark Ecosystem, Parallel Computing Models (MPI/PGAS), GPU/MIC, Virtualization and Cloud Computing. He has published over 60 papers in major journals and international conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Recently, Dr. Lu is doing research and working on design and development for the High-Performance Big Data project (http://hibd.cse.ohio-state.edu). He is a member of IEEE. More details about Dr. Lu are available at http://web.cse.ohio-state.edu/~lu.932/.

Tutorial 2

Title

High Performance Distributed Deep Learning for Dummies

Speakers

Dhabaleswar K. Panda, Ammar Ahmad Awan, and Hari Subramoni (The Ohio State University)

Abstract

The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe/Caffe2, TensorFlow, CNTK, Torch, and several others have emerged that offer ease of use and flexibility to describe, train, and deploy various types of Deep Neural Networks (DNN) including deep convolutional nets. In this tutorial, we will provide an overview of interesting trends in DL and how cutting-edge hardware architectures are playing a key role in moving the field forward. We will also present an overview of DL frameworks from an architectural as well as a performance standpoint. Most DL frameworks have utilized a single GPU to accelerate the performance of DNN training and inference. However, approaches to parallelizing the process of training are also being actively explored. The DL community has moved along MPI based parallel/distributed training as well. Thus, we will highlight new challenges for MPI runtimes to efficiently support DNN training. We highlight how we have designed efficient communication primitives in MVAPICH2 to support scalable DNN training. Finally, we will discuss how co-design of the OSU-Caffe framework and MVAPICH2 runtime enables scale-out of DNN training to 160 GPUs.

Bio

Dhabaleswar K. (DK) Panda

Link

Ammar Ahmad Awan

Ammar Ahmad Awan received his B.S. and M.S.degrees in Computer Science and Engineering from National University of Science and Technology (NUST), Pakistan and Kyung Hee University (KHU), South Korea, respectively. Currently, Ammar is working towards his Ph.D. degree in Computer Science and Engineering at The Ohio State University. His current research focus lies at the intersection of High Performance Computing (HPC) libraries and Deep Learning (DL) frameworks. He previously worked on a Java-based Message Passing Interface (MPI) implementation and investigated nested parallelism with OpenMP and MPI for scientific applications. He has published 14 papers in conferences and journals related to these research areas. His past work has won the Best Paper Runner-up position at EuroMPI 2016. He actively contributes to various projects like MVAPICH2-GDR (High-Performance MPI for GPU Clusters), OMB (OSU Micro Benchmarks), and HiDL (High-Performance Deep Learning). He is the lead author of the OSU-Caffe framework (part of HiDL project) that allows efficient distributed training of Deep Neural Networks. More details are available at http://web.cse.ohio-state.edu/~awan.10

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 50 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High-Performance MPI over InfiniBand, iWARP, and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC, and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available from http://www.cse.ohio-state.edu/~subramon.

Tutorial 3

Title

Designing and Developing Performance Portable Network Codes

Speakers

Pavel Shamis (ARM) and Yossi Itigin (Mellanox Technologies)

Abstract

Developing high-performing and portable communication libraries that caters to different programming paradigms and different architectures is challenging and a resource consuming task. For performance, these libraries have to leverage the rich network capabilities provided by the network interconnects, often requiring deep understanding of wide-range of network capabilities, interplay of software and hardware components, and its impact on the library implementation and application communication characteristics. In this tutorial, we will provide insights into the challenges and complexity in developing such communication libraries. We will share our experience developing Unified Communication X (UCX), a framework of network APIs and implementations for compute- and data-intensive scientific Computing.

Bio

Pavel Shamis

Pavel Shamis is a principal research engineer at ARM. His research interests include high-performance communication networks, communication middleware, and programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in highperformance communication domain including: Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was responsible for development HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, etc.

Pavel earned his MCS of Computer Science degree from Colorado State University, and his B.Sc degree in Education in Technology and Computer Science from Technion, Israel Institute of Technology. Pavel is a recipient of R&D100 award for his contribution in the development of the CORE-Direct collective offload technology.

Yossi Igitin

Yossi Itigin is a principal software engineer and team leader in the HPC group at Mellanox Technologies. For the past six years he has been leading developer and maintainer of MXM and OpenUCX projects, focusing on highly optimized messaging solutions for modern clusters. Prior to joining Mellanox, he worked in Voltaire for four years, during which time he developed the first generation of collective accelerator (FCA). Yossi earned his B.Sc in Computer Science from the Tel-Aviv University in Israel.

Tutorial 4

Title

Developing to Open Fabrics Interfaces libfabric

Speakers

Sean Hefty (Intel) and James Swaro (Cray)

Abstract

Open Fabrics Interfaces, or OFI, is a network agnostic framework that exports fabric communication services to applications. OFI is designed to meet the performance and scalability requirements of HPC applications, such as MPI, SHMEM, PGAS, DBMS, and enterprise applications, running in a tightly coupled network environment. Libfabric is a core component of OFI. It is the library that defines and exports the user-space API of OFI, and is typically the only software that applications deal with directly. The goal of libfabric is to define interfaces that enable a tight semantic map between applications and underlying fabric services. This tutorial describes the libfabric architecture and interfaces, with the aim of instructing developers on how the features of libfabric may best be employed.

Bio

Sean Hefty

Sean Hefty has 24 years of industry experience at Intel, focused on high-performance networking. He was involved in InfiniBand at its inception, and is a long time contributor and maintainer of OpenFabrics software, both for Linux and Windows systems. He is the lead architect and maintainer behind Open Fabrics Interfaces' libfabric. In addition to working at Intel, he taught at Oregon Tech as adjunct faculty for 15 years, focusing on database systems, computer networking, and information technology. He holds advanced degrees in computer science and mathematics, and plays kazoo in a heavy metal band.

James Swaro

James Swaro is a software engineer at Cray Inc, focused on high-performance networking. For the past 2 years, he has been contributor to the OpenFabrics Interfaces' libfabric. James earned his M.S. of Computer Science degree from Ohio University.

Tutorial 5

Title

The TraceR/CODES Framework for Application Simulations on HPC Networks

Speakers

Nikhil Jain (Lawrence Livermore National Laboratory) and Misbah Mubarak (Argonne National Laboratory)

Abstract

Design space exploration and procurement process for next-generation high performance computing systems (HPC) is often guided by the expected performance and cost tradeoffs offered by various alternative options. The increasing complexity of today's HPC architectures negatively impacts the prediction accuracy of simple models, and thus necessitates the need for detailed simulations. In this tutorial, we focus on simulation of networks, which is a major component of HPC systems, and discuss factors that impact simulation of realistic scenarios. We will introduce the TraceR/CODES simulation framework, which has been developed to facilitate studies of application performance on future networks. We will present the capabilities of this framework and describe how these capabilities can be used to mimic real-world scenarios in simulations. In particular, we will discuss how production applications and their multi-job workloads with customized job placement schemes can be simulated with minimal e ort. The tutorial will also touch upon the installation process, usage guidelines, and brief notes on community software used by the framework. Finally, we will present case studies from recent work that illustrate how the TraceR/CODES framework can be used for conducting interesting design space explorations and procurement studies.

Bio

Nikhail Jain

Nikhil Jain is a Sidney Fernbach postdoctoral fellow in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. He works on topics related to parallel computing including networks, scalable application development, parallel algorithms, communication optimization, and interoperation of languages. Nikhil received a Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2016, and B.Tech. and M.Tech degrees in Computer Science and Engineering from I.I.T. Kanpur, India in May 2009. He was awarded the IBM PhD fellowship in 2014 and the Andrew and Shana Laursen fellowship in 2011.

Misbah Mubarak

Misbah Mubarak is a postdoctoral researcher in the Mathematics and Computer Science division at Argonne National Laboratory. At Argonne, she is part of the data-intensive science group that is helping researchers make use of their big data on high performance computing systems. Misbah received her PhD and Masters in computer science from Rensselaer Polytechnic Institute (RPI) in 2015 and 2011 respectively. She also has experience working at CERN, Switzerland and Teradata Corporation. She is the recipient of U.S. Fulbright scholarship, ACM SIGSIM PADS PhD colloquium award and a finalist for Google Anita Borg scholarship.