|
Time |
Wednesday, August 24 (Symposium) |
8:00-8:45 |
Breakfast and Registration |
8:45-9:00 |
Introduction
Ryan Grant & Charlie Perkins General Chairs
James Dinan & Ricki Williams Technical Program Chairs
|
9:00-9:10 |
Host Opening Remarks
Mike McBride, Huawei
|
9:10-10:15 |
Keynote |
Session Chair: Ryan Grant
|
Cloudcasting - Perspectives on Virtual Routing for Cloud Centric Network Architectures
Kiran Makhijani, Huawei
|
10:15-10:30 |
Morning Break |
10:30-12:00 |
Routing and Network Topology |
Session Chair: Edgar A. Leon
|
Ensuring Deadlock-Freedom in Low-Diameter InfiniBand Networks
Abstract
Lossless networks, such as InfiniBand use flow- control to avoid packet-loss due to congestion. This introduces dependencies between input and output channels, in case of
cyclic dependencies the network can deadlock. Deadlocks can be resolved by splitting a physical channel into multiple virtual channels with independent buffers and credit systems.
Currently available routing engines for InfiniBand assign entire paths from source to destination nodes to different virtual channels. However, InfiniBand allows changing the virtual
channel at every switch. We developed fast routing engines which make use of that fact and map individual hops to virtual channels. Our algorithm imposes a total order on virtual
channels and increments the virtual channel at every hop, thus the diameter of the network is an upper bound for the required number of virtual channels. We integrated this
algorithm into the InfiniBand software stack. Our algorithms provide deadlock free routing on state-of-the- art low-diameter topologies, using fewer virtual channels than currently
available practical approaches, while being faster by a factor of four on large networks. Since low-diameter topologies are common among the largest supercomputers in the world,
to provide deadlock-free routing for such systems is very important.
Authors affliliation: ETH Zurich, Switzerland
T. Schneider, O. Bibartiu and T. Hoefler
Scalable, Global, Optimal-bandwidth, Application-Specific Routing
Abstract
High performance computing platforms can benefit from additional bandwidth from the interconnection network because there are many applications with significant communication demands. Further,
many HPC applications expressed as MPI programs have stable communication patterns across runs. Ideally, one would like to exploit the stable communication patterns by using global routing of
communication paths to minimize network contention. Unfortunately, existing optimal-bandwidth, global routing techniques use mixed integer linear programs which fundamentally do not scale to the
sizes that HPC workloads and platforms demand. Consequently, HPC platforms use simple distributed routing techniques, possibly with local adaptive routing, at best. Our design – Scalable
Global Routing (SGR) – addresses this gap. Simulations reveal that in a 4096-node, 4D-torus network, SGR achieves global route computation with a speedup of nearly two orders of magnitude
over prior global routing techniques. SGR outperforms simpler (non-global) routing techniques such as minimal adaptive routing by a 3.1X margin and non-minimal adaptive routing by a 37% margin.
Authors affliliation: Cairo University, Egypt
Purdue University*, USA
A. Abdel-Gawad and M. Thottethodi*
Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks
Abstract
The Cray Cascade architecture uses Dragonfly as its interconnect topology and employs a globally adaptive routing scheme called UGAL. UGAL directs traffic based on link loads but may make
inappropriate adaptive routing decisions in various situations, which degrades its performance. In this work, we propose to improve UGAL by incorporating a traffic patternbased adaptation
mechanism for intra-group communication in Dragonfly. The idea is to explicitly use the link usage statistics that are collected in performance counters to infer the traffic pattern, and to
take the inferred traffic pattern plus link loads into consideration when making adaptive routing decisions. Our performance evaluation results on a diverse set of traffic conditions indicate
that by incorporating the traffic pattern-based adaptation mechanism, our scheme is more effective in making adaptive routing decisions and achieves lower latency under low load and higher
throughput under high load than the existing UGAL in many situations.
Authors affliliation: Florida State University, USA
Los Alamos National Lab*, USA
P. Faizian, M. S. Rahman, M. A. Mollah, X. Yuan, S. Pakin* and M. Lang*
|
12:00-13:30 |
Lunch |
13:30-15:00 |
Switch Architecture and Traffic Management |
Session Chair: Madeleine Glick
|
Improvements to the InfiniBand Congestion Control Mechanism
Abstract
The InfiniBand Congestion Control mechanism (IB CC) is able to reduce the negative consequences of congestion in many situations. However, its effectiveness depends on a
set of parameters that must be set by administrators. If the parameters are not appropriately configured, IB CC could negatively impact network performance. Additionally,
no one has been able to find a universal parameter setting that can fit all situations. These difficulties prevent IB CC from being widely used. In this paper we propose several
enhancements to the existing IB CC. First, our improved IB CC significantly reduces parameter configuration. Second, the congestion will be removed quickly. Third, a new utilization-driven
approach and a new Link Bandwidth Availability Report (LBAR) approach are implemented to guide sending interfaces on how and when to adjust their injection rates. These adjustments are
aware of the actual network condition, rather than relying on preconfigured parameters, as in the existing IB CC. Simulation results have demonstrated that our improved IB CC is able to
reduce the congestion consequences efficiently and can adapt to various network topologies and traffic patterns.
Authors affliliation: University of New Hampshire, USA
Simula Research Lab*, Norway
Q. Liu, R. D. Russell and E. G. Gran*
Scalable High-Radix Modular Crossbar Switches
Abstract
Crossbars are a basic building block of networks on chip that can be used as fast, single-stage networks or in router cores for larger scale networks. However, scaling crossbars to
high radices presents a number of efficiency, performance, and area challenges. Thus, we propose modular flow-through crossbar switch cores that perform better at high radices than
conventional monolithic designs. The modular sub-blocks are arranged in a controlled flow-through, pipelined scheme to eliminate global connections and maintain linear performance scaling
and high throughput. Modularity also enables energy savings via deactivation of unused I/O wires. Evaluation using an analytical crossbar switch modeling tool demonstrated improved energy
delay product (up to 5.3X) compared to conventional crossbar switches, but with approximately 30% area overhead. Further, we evaluated modular crossbar networks with the proposed switch cores
using BookSim2, cycle-accurate detailed network on chip tool. The proposed design achieves more than 90% saturation capacity with an internal speed up of 1.5, supports data line rates as high
as 102.4Gbps (in 40nm CMOS bulk), and offers lower average network latency compared to conventional crossbars.
Authors affliliation: CMU, USA
Altera Corp*, USA
Oracle Labs†, USA
C. Cakir, R. Ho*, J. Lexau† and K.Mai
A Clos-Network Switch Architecture based on Partially-Buffered Crossbar Fabrics
Abstract
Modern Data Center Networks (DCNs) that scale to thousands of servers require high performance switches/routers to handle high traffic loads with minimum delays. Today’s switches
need be scalable, have good performance and -more importantly- be cost-effective. This paper describes a novel three-stage Clos-network switching fabric with partially-buffered crossbar
modules and different scheduling algorithms. Compared to conventional fully buffered and buffer-less switches, the proposed architecture fits a nice model between both designs and takes
the best of both: i) less hardware requirements which considerably reduces both the cost and the implementation complexity, ii) the existence of few internal buffers allows for simple and
high performance scheduling. Two alternative scheduling algorithms are presented. The first is scalable, it disperses the control function over multiple switching elements in the Clos-network.
The second is simpler. It places some control on a central scheduler to ensure an ordered packets delivery. Simulations for various switch settings and traffic profiles have shown that the
proposed architecture is scalable. It maintains high throughput, low latency performance for less hardware used.
Authors affliliation: University of Leeds, England
F. Hassen and L. Mhamdi
|
15:00-15:15 |
Afternoon Break |
15:15-16:45 |
Panel |
Moderator: Ron Brightwell
|
Many-core Reality Check — How increasing core counts, on-node networks, and deep integration will impact system interconnects
Abstract
Node architectures have entered an era of intense innovation, with trends toward increasing numbers of devices per processor; integration of memory and the introduction
of new memory technologies; and rapid increases in the scale of on-chip, on-package, and on-node networks. These trends, in turn, are influencing the design of network
architectures and redefining the interaction between nodes and the network. In this session, our panelists will critically evaluate these trends to identify key issues
that must be solved by the coming generations of high-speed networking technologies.
Mark Cummings, Orchestral Networks
Scot Schulz, Mellanox
Pavel Shamis, ARM
Keith Underwood, Intel
|
Time |
Thursday, August 25 (Symposium) |
8:15-9:00 |
Breakfast and Registration |
09:00-09:30 |
|
Session Chair: Charlie Perkins
|
Building Large Scale Data Centers: Cloud Network Design Best Practices
Abstract
In this talk, we examine the network design principles of large
scale Data Centers. Public cloud application scale tends to exceed the
capacity of any multi-processor machine, making distributed applications
the norm. These distributed applications are decomposed and deployed
across multiple physical (or virtual) servers which introduce network
demands for intra-application communications. Most large scale data
centers have a scalable network infrastructure which happens to be a good
fit for the distributed applications model. This deployment model is
evolving to include parallel software clusters, microservices, and machine
learning clusters. This evolution has ramifications on the corresponding
network attributes. We map network best practices down to salient switch
and NIC ASIC devices at the architecture and feature level, and time
permitting we discuss how these practices are expressed into the public
information major operators have published about their Data Center
networks.
Ariel Hendel, Broadcom Limited (Invited Talk)
Bio
Broadcom Distinguished Engineer focusing on Data Center Networks and Switch Architecture.
Ariel joined Broadcom in 2008, prior to Broadcom he was a Distinguished
Engineer at Sun Microsystems where he was twice recipient of the Chairman
Award.
He earned his BS degree from the Technion, Haifa, and MS from Polytechnic
University, New York. Ariel holds more than 40 patents with several more
pending.
|
09:30-10:00 |
|
Session Chair: Charlie Perkins
|
Network topologies for large-scale compute centers: It's the diameter, stupid!
Abstract
We discuss the history and design tradeoffs for large-scale
topologies in high-performance computing. We observe that datacenters
are slowly following due to the growing demand for low latency and high
throughput at lowest cost. We then introduce a high-performance
cost-effective network topology called Slim Fly that approaches the
theoretically optimal network diameter. We analyze Slim Fly and compare
it to both traditional and state-of-the-art networks. Our analysis shows
that Slim Fly has significant advantages over other topologies in
latency, bandwidth, resiliency, cost, and power consumption. Finally, we
propose deadlock-free routing schemes and physical layouts for large
computing centers as well as a detailed cost and power model. Slim Fly
enables constructing cost effective and highly resilient datacenter and
HPC networks that offer low latency and high bandwidth under different
HPC workloads such as stencil or graph computations.
Torsten Hoefler, ETH Zurich (Invited Talk)
Bio
Torsten is an Assistant Professor of Computer Science at ETH Zurich,
Switzerland. Before joining ETH, he led the performance modeling and
simulation efforts of parallel petascale applications for the NSF-funded
Blue Waters project at NCSA/UIUC. He is also a key member of the
Message Passing Interface (MPI) Forum where he chairs the "Collective
Operations and Topologies" working group. Torsten won best paper awards
at the ACM/IEEE Supercomputing Conference SC10, SC13, SC14, EuroMPI'13,
HPDC'15, HPDC'16, IPDPS'15, and other conferences. He published
numerous peer-reviewed scientific conference and journal articles and
authored chapters of the MPI-2.2 and MPI-3.0 standards. He received the
Latsis prize of ETH Zurich as well as an ERC starting grant in 2015. His
research interests revolve around the central topic of
"Performance-centric System Design" and include scalable networks,
parallel programming techniques, and performance modeling. Additional
information about Torsten can be found on his homepage at
htor.inf.ethz.ch.
|
10:00-10:30 |
Morning Break |
10:30-12:00 |
Memory and Data Caching |
Session Chair: Ricki Williams
|
Race Cars vs. Trailer Trucks: Switch Buffers Sizing vs. Latency Tradeoffs in Data Center Networks
Abstract
This paper raises the data center designers question of trade-off between high-buffer switches versus low-latency switches. Packet buffer hardware dictates this trade-off due to
the constraints of DRAM and SRAM technologies. While the designers who prefer network robust solutions would typically prefer large-buffer switches with settling for high latency;
the designers who can adapt applications to the network behavior would prefer the low-latency switches in order to gain better application performance. In this paper, we review the
question of switch buffer sizing in data center networks, by considering the switch delay in light of common traffic patterns in data centers. To the best of our knowledge, this is
the first paper that discusses the switch buffer sizing question by considering switch latency trade-off. We review previous works on switch buffer sizing given the typical parameters
of data center networks, and survey the typical data center traffic patterns that challenge the switch buffer. Also, we provide simulation results that show the effect of switch latency
on the effective bandwidth of acknowledgement-based congestion controlled flows. Finally, we discuss the gain that flow control provides to end-to-end network performance.
Authors affliliation: Mellanox, Israel
A. Shpiner and E. Zahavi
A Multilevel NOSQL Cache Design Combining In-NIC and In-Kernel Caches
Abstract
Since a large-scale in-memory data store, such as key- value store (KVS), is an important software platform for data centers, this paper focuses on an FPGA-based custom hardware
to further improve the efficiency of KVS. Although such FPGA-based KVS accelerators have been studied and shown a high performance per Watt compared to software-based processing,
since their cache capacity is strictly limited by the DRAMs implemented on FPGA boards, their application domain is also limited. To address this issue, in this paper, we propose
a multilevel NOSQL cache architec- ture that utilizes both the FPGA-based hardware cache and an in- kernel software cache in a complementary style. They are referred as L1 and L2 NOSQL
caches, respectively. The proposed multilevel NOSQL cache architecture motivates us to explore various design options, such as cache write and inclusion policies between L1 and L2 NOSQL
caches. We implemented a prototype system of the proposed multilevel NOSQL cache using NetFPGA-10G board and Linux Netfilter framework. Based on the prototype implementation, we explore
the various design options for the multilevel NOSQL caches. Simulation results show that our multilevel NOSQL cache design reduces the cache miss ratio and improves the throughput compared
to the non-hierarchical design.
Authors affliliation: Keio University, Japan
Y. Tokusashi and H. Matsutani
RoB-Router: Low Latency Network-on-Chip Router Microarchitecture Using Reorder Buffer
Abstract
Switch allocation is the critical pipeline stage for network-on-chips (NoCs) and it is influenced by the order of packets in input buffers. Traditional input-queued routers in NoCs only
have a small number of virtual channels (VCs) and the packets in a VC are organized in fixed order. Such design is susceptible to head-of-line (HoL) blocking as only the packet at the head
of a VC can be allocated by the switch allocator. HoL blocking significantly degrades the efficiency of switch allocation as well as the performance of NoCs. In this paper, we propose to utilize
reorder buffer (RoB) techniques to mitigate HoL blocking and accelerate switch allocation and thus reduce the latency of NoCs. We propose to design VCs as RoBs to allow packets located not at the
head of a VC to be allocated before the head packet. RoBs reduce the conflicts in switch allocation and can efficiently increase matching number in switch allocation. We design RoB-Router based on
traditional input-queued routers in a lightweight fashion considering the trade-off between performance and cost. Our design can be extended to most state-of-the-art input-queued routers. Evaluation
results show that RoB-Router can achieve 46% and 15.7% performance improvement in packet latency under synthetic traffic and traces from PARSEC than current most efficient switch allocator TS-Router,
and the cost of energy and area is moderate.
Authors affliliation: National University of Defense Technology, China
C. Li, D. Dong, X. Liao, J. Wu and F. Lei
|
12:00-13:30 |
Lunch |
13:30-14:30 |
Keynote |
Session Chair: Ricki Williams
|
Software-Defined Everything and the New Role of Interconnects
Roy Chua, SDxCentral
|
14:30-14:45 |
Afternoon Break |
14:45-16:15 |
Node and Network Architectures |
Session Chair: Songkrant Muneenaem
|
Offloading Collective Operations to Programmable Logic on a Zynq Cluster
Abstract
This paper describes our architecture and implementation for offloading collective operations to programmable logic in the communication substrate. Collective operations –
operations that involve communication between groups of co- operating processes – are widely used in parallel processing. The design and implementation strategies of collective operations
plays a significant role in their performance and thus affects the performance of many high performance computing applications that utilize them. Collectives are central to the widely used Message
Passing Interface (MPI) programming model. The programmable logic provided by FPGAs is a powerful option for creating task-specific logic to aid applications. While our work is evaluated on the
Xilinx Zynq SoC, it is generally applicable in scenarios where there is programmable logic in the communication pipeline, including FPGAs on network interface cards like the NetFPGA or new systems
like Intel’s Xeon with on-die Altera FPGA resources. In this paper we have adapted and generalized our previous work in offloading collective operations to the NetFPGA. Here we present a general
collective offloading framework for use in applications using the Message Passing Interface (MPI). The implementation is realized on the Xilinx Zynq reference platform, the Zedboard, using an Ethernet
daughter card called EthernetFMC. Results from microbenchmarks are presented as well as from some scientific applications using MPI.
Authors affliliation: Indiana Univeristy, USA
O. Arap and M. Swany
Exploring Data Vortex Network Architectures
Abstract
In this work, we present an overview of the Data Vortex interconnection network, a network designed for both traditional HPC and emerging irregular and data analytics workloads.
The Data Vortex network consists of a congestion-free, high-radix network switch and a Vortex Interconnection Controller (VIC) that interfaces the compute node with the rest of the network.
The Data Vortex network is designed to transfer fine-grained network packets at a high injection rate, without congestioning the network or negatively impacting performance. Our results show
that the Data Vortex networks is more efficient than traditional HPC networks with fine-grained data transfers. Moreover, our experiments show that a Data Vortex system achieves higher scalability
even when using global synchronization primitives.
Authors affliliation: Pacific Northwest National Lab, USA
CMU*, USA
R. Gioiosa, T. Warfel*, J. Yin, A. Tumeo and D. Haglin
Exploring Wireless Technology for Off-Chip Memory Access
Abstract
The trend of shifting from multi-core to many-core processors is exceeding the data-carrying capacity of the traditional on-chip communication fabric. While the importance of the on-chip communication
paradigm cannot be denied, the off-chip memory access latency is fast becoming an important challenge. As more memory intensive applications are developed, off-chip memory access will limit the performance
of chip multi-core processors (CMPs). However, with the shrinkage of transistor dimension, the energy consumption and the latency of the traditional metallic interconnects are increasing due to smaller wire
widths, longer wire lengths, and complex multi-hop routing requirements. In contrast, emerging wireless technology requires lower energy with single-hop communication, albeit with limited bandwidth (at a 60
GHz center frequency). In this paper, we have proposed several hybrid-wireless architectures to access off-chip memory by exploiting frequency division multiplexing (FDM), time division multiplexing (TDM),
and space division multiplexing (SDM) techniques. We explore the design-space of building hybrid-wireless interconnects by considering conservative and aggressive wireless bandwidths and directionality. Our
hybrid-wireless architectures require a maximum of two hops and show 10.91\% reduction in execution time compared to a baseline metallic architecture. In addition, the proposed hybrid-wireless architectures
show on an average 62.07\% and 32.52\% energy per byte improvement over traditional metallic interconnects for conservative and aggressive off-chip metallic link energy-efficiency respectively. Nevertheless,
the proposed hybrid-wireless architectures incur an area overhead due to the higher transceiver area requirement.
Authors affliliation: Ohio University, USA
M. A. Sikder, A. Kodi, S. Kaya, W. Rayess, D. Matolak and D. Ditomaso
|
16:15-16:30 |
Awards & Closing Remarks
Ryan Grant & Charlie Perkins, General Chairs
|
|
FOLLOW US ON:
|
IMPORTANT DATES (Tutorials)
|
Materials due: August 10 |
SPONSORED BY:
|
|
|