IEEE Hot Interconnects | Huawei North America Headquarters, Santa Clara, CA

HOSTED BY:

PLATINUM PATRON

GOLD PATRON

SISTER
CONFERENCES

Time

Wednesday, August 22 (Symposium)

8:00-8:45

Breakfast and Registration

8:45-9:00

Intro
Torsten Hoefler Technical Program Chair
Patrick Geoffray, Hamid Ahmadi General Chairs

9:00-10:00

Keynote I

Session Chair: Patrick Geoffray

The Future Of Network Technology - What is Old, is New Again
John Roese, Huawei

10:00-10:30

Morning Break

10:30-12:00

Network Acceleration

Session Chair: Ada Gavrilovska

ParaSplit: A Scalable Architecture on FPGA for Terabit Packet Classification

Abstract
Packet classification is a fundamental enabling function for various applications in switches, routers and firewalls. Due to their performance and scalability limitations, current packet classification solutions are insufficient in addressing the challenges from the growing network bandwidth and the increasing number of new applications. This paper presents a scalable parallel architecture, named ParaSplit, for high-performance packet classification. We propose a rule set partitioning algorithm based on range-point conversion to reduce the overall memory requirement. We further optimize the partitioning by applying the Simulated Annealing technique. We implement the architecture on a Field Programmable Gate Array (FPGA) to achieve high throughput by exploiting the abundant parallelism in the hardware. Evaluation using real-life data sets show that ParaSplit achieves significant reduction in memory requirement, compared with the-state-of-the-art algorithms such as HyperSplit and EffiCuts. Because of the memory efficiency of ParaSplit, our FPGA design can support in the on-chip memory multiple engines, each of which contains up to 10K complex rules. As a result, the architecture with multiple ParaSplit engines in parallel can achieve up to Terabit per second throughput for large and complex rule sets on a single FPGA device.
J. Fong, X. Wang, Y. Qi, J. Li and W. Jiang
A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT)

Abstract
Current High-Frequency Trading (HFT) platforms are typically implemented in software on computers with high-performance network adapters. The high and unpredictable latency of these systems has led the trading world to explore alternative "hybrid" architectures with hardware acceleration. In this paper, we describe how FPGAs are being used in electronic trading to approach the goal of zero latency. We present an FPGA IP library which implements networking, I/O, memory interfaces and financial protocol parsers. The library provides pre-built infrastructure which accelerates the development and verification of new financial applications. We have developed an example financial application using the IP library on a custom 1U FPGA appliance. The application sustains 10Gb/s Ethernet line rate with a fixed end-to-end latency of 1μ - up to two orders of magnitude lower than comparable software implementations.
J. Lockwood, A. Gupte, N. Mehta, M. Blott, T. English and K. Vissers
Rx Stack Accelerator for 10 GbE Integrated NIC

Abstract
The miniaturization of CMOS technology has reached a scale at which server processors are starting to integrate multi-gigabit network interface controllers. While transistors are becoming cheap and abundant in solid-state circuits, they remain a scarce resource on a processor die where ever more cores and caches must share a fixed amount of silicon area and power. Therefore, a successful design candidate for integration must provide high networking performance under high logic density and low power dissipation.
This paper describes the design of an integrated accelerator to offload computation intensive protocol-processing tasks. The accelerator combines the concepts of the transport-triggered architecture with a programmable finite-state machine to deliver high instruction-level parallelism, efficient multiway branching and flexibility. The flexibility is key to adapt to protocol changes and address new applications.
This receive stack accelerator was used in the construction of an integrated quad-port 10 GbE host Ethernet adapter in 45-nm CMOS technology. The ratio of performance (15 Mfps, 20 Gb/s Tput per port) to area (0.7 mm2) and the power consumption (0.15 W) of this accelerator are core enablers for integrating a network adapter and a processor compute complex.
F. Abel, F. Verplanken, C. Hagleitner

12:00-13:00

Lunch

13:00-14:00

Traffic Generation and Scheduling

Session Chair: Christos Kolias

Caliper: Precise and Responsive Traffic Generator

Abstract
This paper presents Caliper, a highly-accurate packet injection tool that generates precise and responsive traffic. Caliper takes live packets generated on a host computer and transmits them onto a gigabit Ethernet network with precise inter-transmission times. Existing software traffic generators rely on generic Network Interface Cards which, as we demonstrate, do not provide high-precision timing guarantees. Hence, performing valid and convincing experiments becomes difficult or impossible in the context of time-sensitive network experiments. Our evaluations show that Caliper is able to reproduce packet inter-transmission times from a given arbitrary distribution while capturing the closed-loop feedback of TCP sources. Specifically, we demonstrate that Caliper provides three orders of magnitude better precision compared to commodity NIC: with requested traffic rates up to the line rate, Caliper incurs an error of 8 ns or less in packet transmission times. Furthermore, we explore Caliper's ability to integrate with existing network simulators to project simulated traffic characteristics into a real network environment. Caliper is freely available online.
M. Ghobadi, G. Salmon, Y. Ganjali, M. Labrecque and J. G. Steffan
Weighted Differential Scheduler

Abstract
The Weighted Differential Scheduler (WDS) is a new scheduling discipline for accessing shared resources. The work described here was motivated by the need for a simple weighted scheduler for a network switch where multiple packet flows are competing for an output port. The scheme can be implemented with simple arithmetic logic and finite state machines.
We are describing several versions of WDS that can merge two or more flows. An analysis reveals that WDS has lower jitter than any other weighted scheduler known to us.
H. Eberle and W. Olesinski

14:00-14:30

Afternoon Break

14:30-15:30

Keynote II

Session Chair: Torsten Hoefler

Cray High Speed Networking
Bob Alverson, Cray Inc.

15:30-16:00

Session Chair: Fredy Neeser

How to Compare Alternative Architectures

Abstract
There are various aspects of network infrastructure that are orthogonal, and therefore can be compared conceptually. For example, the syntax of encapsulation, how forwarding tables are calculated, and whether forwarding tables are filled in proactively, or on-demand when a new flow starts. This talk will explain these concepts, and show how various proposed architectures (such as TRILL, VXLAN, OpenFlow, etc. compare.)

Radia Perlman, Intel (invited talk)

16:10-17:00

Cocktail Reception

17:00-18:30

Evening Panel

Moderator: Fabrizio Petrini

The Network is Moving into the Sockets

Abstract
The recent acquisitions of Intel—FulcrumMicro, Qlogic and Cray's networking division, combined with AMD's acquisition of SeaMicro show that the heat is on in the networking world.

There is clear trend to move network interface into the socket, with performance, scalability and power reduction wins when the network sits close to the processing engine.

This has dramatic implications in the data-center and high-performance networking world. In this panel we will discuss how this trend could change the future of upcoming data centers and network architectures.

Lloyd Dickman Bay Storage Technology
Christian Bell Myricom
Gilad Shainer Mellanox
Moray McLaren HP
Greg Thorson SGI
Keith Underwood Intel

Time

Thursday, August 23 (Symposium)

8:00-9:00

Breakfast and Registration

9:00-10:00

Keynote III

Session Chair: Fabrizio Petrini

Power-Efficient, High-Bandwidth Optical Interconnects for High Performance Computing
Fuad Doany, IBM T. J. Watson

10:10-10:30

Morning Break

10:30-12:00

Performance Evaluation

Session Chair: Torsten Hoefler

Portals 4: Enabling Application/Architecture Co-Design for High-Performance Interconnects

Abstract
The Portals project has entered a third decade of research and development in scalable, high-performance networking for large-scale scientific parallel computing systems. Portals has evolved from its inception as a component of early lightweight operating systems to become an important vehicle for interconnect exploration. Unlike most user-level network programming interfaces, Portals employs a building block approach that encapsulates the semantic requirements of a broad range of upper-level protocols needed to support high-performace computing applications and services. This approach has also enabled hardware designers to focus on developing components that accelerate key functions in Portals, facilitating the application/architecture co-design process. I will provide an overview of the latest version of the Portals interconnect API and describe research activities aimed at exploiting some recently added capabilties.
Ron Brightwell, Sandia National Laboratories (invited talk)
Performance Evaluation of Open MPI on Cray XE/XK Systems

Abstract
Open MPI is a widely used open-source implementation of the MPI-2 standard that supports a variety of platforms and interconnects. Current versions of Open MPI, however, lack support for the Cray XE6 and XK6 architectures, both of which use the Gemini System Interconnect. In this paper, we present extensions to natively support these architectures within Open MPI; describe and propose solutions for performance and scalability bottlenecks; and provide an extensive evaluation of our implementation, which is the first open-source MPI implementation for the Cray XE/XK system families used at 49,152 processes.
Application and micro-benchmark results show that the performance and scaling characteristics of our implementation are similar to the vendor-supplied MPI's. Micro-benchmark results show short-data 1 byte and 1024 byte message latencies of 1.20 usec and 4.13 usec, which are 10.00% and 39.71% better than the vendor-supplied MPI's, respectively. Our implementation achieves a bandwidth of 5.32 GB/s at 8 MB, which is similar to the vendor-supplied MPI's bandwidth at the same message size. Two Sequoia benchmark applications, LAMMPS and AMG2006, were also chosen to evaluate our implementation at scales up to 49,152 cores where we exhibited similar performance and scaling characteristics when compared to the vendor-supplied MPI implementation. LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI, which is on par with the vendor-supplied MPI's achieved parallel efficiency.
S. Gutierrez, M. G. Venkata, N. Hjelm, R. Graham
Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System

Abstract
Communication interfaces of high performance computing (HPC) systems and clouds have been continually evolving to meet the ever increasing communication demands being placed on them by HPC applications and cloud computing middlewares (e.g., Hadoop). The PCIe interfaces can now deliver speeds up to 128 Gbps (Gen3) and high performance interconnects (10/40 GigE, Infini- Band 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE RDMA over Converged Ethernet) are capable of delivering speeds from 10 to 54 Gbps. However no previous study has demonstrated how much benefit an end user in the HPC / cloud computing domain can expect by utilizing newer generations of these interconnects over older ones or how one type of interconnect (such as IB) performs in comparison to another (such as RoCE).
In this paper, we evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middlewares. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications.
J. Vienne , J. Chen, Md. Wasi-ur-Rahman, N. Islam, H. Subramoni, D. K. Panda

12:00-13:00

Lunch

13:00-14:30

Routing and Switching

Session Chair: John Lockwood

Electronic-Photonic Integration within Switches and Routers

Abstract
We review recent successes in silicon photonics and how the new capabilities afforded by silicon photonics will impact future Ethernet, Infiniband, and ultimately optical domain switches and routers. Specifically, we consider the impact silicon photonics can have on the cost, bandwidth, radix, and power consumption scaling of future switches and routers.
Mike Watts, MIT (invited talk)
Bufferless Routing in Optical Gaussian Macrochip Interconnect

Abstract
In this paper, we study bufferless routing in a novel optical multichip system, called Gaussian macrochip, where embedded chips are interconnected by an optical Gaussian network. By taking advantage of the underlying Hamiltonian cycles in the Gaussian network, we design a bufferless routing algorithm for the Gaussian macrochip, which routes packets along the shortest path in the absence of deflection, and guarantees that deflected packets reach their destinations along a segment of the Hamilton cycle. Our extensive simulation results demonstrate that by adopting the proposed routing algorithm, Gaussian macrochip can support much higher inter-chip communication bandwidth, has much shorter average packet delay, and is more power efficient than the previously proposed architectures for optical multichip systems.
Z. Zhang, Z. Guo and Y. Yang
Occupancy Sampling for Terabit CEE Switches

Abstract
One consequential feature of Converged Enhanced Ethernet (CEE) is losslessness, achieved through L2 Priority Flow Control (PFC) and Quantized Congestion Notification (QCN).
We focus on QCN and its effectiveness in identifying congestive flows in input-buffered CEE switches. QCN assumes an idealized, output-queued switch; however, as future switches scale to higher port counts and link speeds, purely output-queued or shared memory architectures lead to excessive memory bandwidth requirements; moreover, PFC typically requires dedicated buffers per input.
Our objective is to complement PFC's coarse per-port/priority granularity with QCN's per-flow control. By detecting buffer overload early, QCN can drastically reduce PFC's side effects. We install QCN congestion points (CPs) at input buffers (e.g. VOQs) and demonstrate that arrival-based marking cannot correctly discriminate between culprits and victims.
Our main contribution is occupancy sampling (QCN-OS), a novel, QCN-compatible marking scheme. We focus on random occupancy sampling, a practical method not requiring any per-flow state. For CPs with arbitrarily scheduled buffers, QCN-OS is shown to correctly identify congestive flows, improving buffer utilization, switch efficiency, and fairness.
F. Neeser, N. Chrysos, R. Clauberg, D. Crisan, M. Gusat, C. Minkenberg, K. Valk, C. Basso

14:30-15:00

Afternoon Break

15:00-16:00

Keynote IV

Session Chair: Dan Pitt

Software Defined Networks will tame complex networks
Nick McKeown, Stanford University

16:00-17:00

Industry Panel

Moderator: Matt Palmer

SDN - Fad or Phenom?

Abstract
Two decades of 'closed' switch and router designs have led to networking R&D 'ossification', possibly constraining the academic and startup innovation. The advent of SDN, virtual overlays and OpenFlow in the context of datacenter networking (DCN) now challenges the status-quo with new designs and players.

Once every generation conflicting forces churn a field, triggering opportunities for innovation and re-adjusting the balance of power, e.g: Is SDN and/or OpenFlow becoming the Linux of networking? Is SDN equivalent with OF? Should their APIs be standardized - or let the market decide? How can we build 1M-node DCNs with centralized controls? What's the impact of SDN and OF on the highly optimized PoDs, vs. the generic vendor fabric? Is the pendulum over-swinging from distributed towards centralized? How about expectation management?

Dave Meyer Cisco
Kireeti Kompella Juniper
Jeff Mogul HP
Vijoy Pandey IBM
Dimitri Stiliadis Lucent

Time	Friday, August 24 (Tutorials)
7:30-8:30	Breakfast and Registration
8:30-12:30	Tutorial 1.1 Hands-on tutorial on Software-defined Networking Srini Seetharaman (Deutsche Telekom), Mike Cohen (BigSwitch)	Tutorial 2.1 Designing Scientific and Enterprise Computing Systems with InfiniBand and High-Speed Ethernet: Current Status and Trends D.K. Panda, Ohio State University
12:30-13:30	Lunch
13:30-17:30	Tutorial 1.2 Interconnection Networks for Cloud Data Centers Sudipta Sengupta, Microsoft Research	Tutorial 2.2 The Evolution of Network Architecture towards Cloud-Centric Applications Loukas Paraschis, Cisco

FOLLOW US ON:

SILVER PATRON

SPONSORED BY: