|
Time |
Wednesday, August 21 (Symposium) |
8:00-8:45 |
Breakfast and Registration |
8:45-9:00 |
Intro
Cyriel Minkenberg, Sudipta Sengupta Technical Program Chairs
Madeleine Glick, Torsten Hoefler, Fabrizio Petrini General Chairs
|
9:00-9:10 |
Welcome Address
Claudio DeSanti, Cisco Fellow
|
9:10-10:10 |
Keynote I |
Session Chair: Cyriel Minkenberg
|
Scale and Programmability in Google's Software Defined Data Center WAN
Amin Vahdat, UCSD/Google
|
10:10-10:30 |
Morning Break |
10:30-12:00 |
On-Chip Communications |
Session Chair: Ada Gavrilovska
|
Deterministic Multiplexing of NoC on Grid CMPs
Abstract
As the number of cores in a chip has increased over
the past several years, the problem of inter-core communication
has become a bottleneck. Traditional bus architectures cannot
handle the traffic load for the increasingly large number of
communicating units on chip. Using a nanophotonic Network-on-
Chip(NoC) is a proposed solution that is motivated by
recent research indicating a higher throughput over electronic
NOCs. In order to avoid optical/electronic/optical conversion
at intermediate routers, all-to-all connectivity can be achieved
through Time-Division-Multiplexing (TDM). Previous work has
focused on using a non-deterministic approach to determine the
multiplexing schedule in optically connected mesh NoCs. Such
an approach, however, produces an irregular schedule that is not
scalable, especially if TDM is to be combined with wavelength-division
multiplexing (WDM) and space-division multiplexing to
reduce the communication delay. In this work, we present a
regular multiplexing schedule for all-to-all connectivity which
is at least as efficient as the previously introduced irregular
schedule. Moreover, because of its regularity and its systematic
construction, our schedule is scalable to arbitrary-size meshes
and allows for efficient combination of TDM, WDM and space
division multiplexing (the use of multiple NoCs).
J. Carpenter and R. Melhem
Minimizing Delay in Shared Pipelines
Abstract
Pipelines are widely used to increase throughput in multi-core chips by parallelizing packet processing. Typically, each packet type is serviced by a dedicated
pipeline. However, with the increase in the number of packet types and their number of required services, there are not enough cores for pipelines. In this paper,
we study pipeline sharing, such that a single pipeline can be used to serve several packet types. Pipeline sharing decreases the needed total number of cores, but
typically increases pipeline lengths and therefore packet delays. We consider the optimization problem of allocating cores between different packet types such that
the average delay is minimized. We suggest a polynomial-time algorithm that finds the optimal solution when the packet types preserve a specific property. We also present
a greedy algorithm for the general case. Last, we examine our solutions on synthetic examples, on packet-processing applications, and on real-life H.264
standard requirements.
O. Rottenstreich, I. Keslassy, Y. Revah and A. Kadosh
Heterogeneous Multi-processor Coherent Interconnect
Abstract
The rapid increase in processor and memory integration onto a single die continues to place increasingly complex demands on the interconnect network. In addition
to providing low latency, high speed and high bandwidth access from all processors to all shared resources, the burdens of hardware cache coherence and resource
virtualization are being placed upon the interconnect as well. This paper describes a multi-core shared memory controller interconnect (MSMC) which supports up to 12
processors, 8 independent banks of IO-coherent on-chip shared SRAM, an IO-coherent external memory controller, and high-bandwidth IO connections to the SoC infrastructure.
MSMC also provides basic IO address translation and memory protection for the on-chip shared SRAM and external memory as well as soft error (SER) protection with hardware
scrubbing for the on-chip memory. MSMC formed the heart of the compute cluster for a 28-nm CMOS device including 8 Texas Instruments C66x DSP processors and 4 cache-coherent
ARM A15 processors sharing 6 MB of on-chip SRAM running at 1.3 Ghz. At this speed MSMC provides all connected masters a combined read/write bandwidth of nearly 1TB/s to access
a combined read/write bandwidth of 457.6 GB/s to all shared resources @ 16 mm^2.
K. Chirca, M. Pierson, J. Zbiciak, D. Thompson, D. Wu, S. Myilswamy, R. Griesmer, K. Basavaraj, T. Huynh, A. Dayal,
J. You, P. Eyres, Y. Ghadiali, T. Beck, A. Hill, N. Bhoria, D. Bui, J. Tran, M. Rahman, H. Fei, S. Jagathesan and T. Anderson
|
12:00-13:00 |
Lunch |
13:00-14:30 |
TCP/IP |
Session Chair: Rami Melhem
|
TCP Pacing in Data Center Networks
Abstract
This paper studies the effectiveness of TCP pacing in data center networks. TCP
senders inject bursts of packets into the network at the beginning of each round-trip
time. These bursts stress the network queues which may cause loss, reduction in
throughput and increased latency. Such undesirable effects become more pronounced in
data center environments where traffic is bursty in nature and buffer sizes are
small. TCP pacing is believed to reduce the burstiness of TCP traffic and to mitigate
the impact of small buffering in routers. Unfortunately, current research literature
has not always agreed on the overall benefits of pacing. In this paper, we present a
model for the effectiveness of pacing. Our model demonstrates that for a given buffer
size, as the number of concurrent flows are increased beyond a {\em Point of
Inflection} (PoI), non-paced TCP outperforms paced TCP. We present lower and upper
bounds for the PoI and argue that increasing the number of concurrent flows beyond the
PoI, increases inter-flow burstiness of paced packets and reduces the effectiveness of
pacing. We validate our model using a novel and practical implementation of paced TCP in the
Linux kernel and perform several experiments in a test-bed.
M. Ghobadi and Y. Ganjali
Clustered Linked List Forest for IPv6 Lookup
Abstract
Providing a high operating frequency and abundant parallelism, Field Programmable Gate Arrays (FPGAs) are the most promising platform to implement SRAM-based
pipelined architectures for high-speed Internet Protocol (IP) lookup. Owing to the restrictions of the state-of-the-art FPGAs on the number of I/O pins and on-chip memory,
the existing approaches can hardly accommodate the large and sparsely-distributed IPv6 routing tables. Therefore, memory efficient data structures are recently in high demand.Ê
In this paper, clustered linked list forest (CLLF) data structure is proposed for solving the longest prefix matching (LPM) problem in IP lookup. Our structure comprising multiple
parallel linked lists of prefix nodes achieves significant memory compaction in comparison to the existing approaches. CLLF data structure is implemented on a high throughput SRAM-based
parallel and pipelined architecture on FPGAs. Utilizing a state-of-the-art FPGA device, CLLF architecture can accommodate up to 686K IPv6 prefixes while supporting fast incremental routing
table updates.
O. Erdem and A. Carus
HybridCuts: A Scheme Combining Decomposition and Cutting for Packet Classification
Abstract
Packet classification is an enabling function for a variety of Internet applications such as access control, quality of service and differentiated services. Decision-tree and decomposition are the most
well-known algorithmic approaches. Compared to architectural solutions, both approaches are memory and performance inefficient, falling short of the needs of high-speed networks. EffiCuts, the state-of-the-art
decision-tree technique, significantly reduces memory overhead of classic cutting algorithms with separated trees and equi-dense cuts. However, it suffers from too many memory accesses and a large number of
separated trees. Besides, EffiCuts needs comparator circuitry to support equi-dense cuts, which makes it less practical. Decomposition based schemes, such as BV, can leverage the parallelism offered by modern
hardware for memory accesses, but they have poor storage scalability. In this paper, we propose HybridCuts, a combination of decomposition and decision-tree techniques that improves storage and performance simultaneously.
The decomposition part of HybridCuts has the benefits of traditional decomposition-based techniques without the trouble of aggregating results from a large number of bit vectors or a set of big lookup tables. Meanwhile,
thanks to the clever partitioning of the rule set, an efficient cutting algorithm following the decomposition can build short decision trees with significant reduction on rule replications. Using ClassBench, we show that
HybridCuts achieves similar memory reduction compared to Efficuts, but it outperforms Efficuts significantly in terms of memory accesses for packet classification. In addition, HybridCuts is more practical for implementation
than Efficuts, which maintains complicated data structures, takes a huge amount of time for tree merging, and requires special hardware support for efficient cuts.
W. Li and X. Li
|
14:30-15:00 |
Afternoon Break |
15:00-15:30 |
|
Session Chair: Cyriel Minkenberg
|
Architecture and Performance of the Tilera TILE-Gx8072 Manycore Processor
Abstract
This talk describes the Tilera TILE-Gx processor architecture, discusses the design choices, and presents performance results on representative applications for the 72-core
TILE-Gx72™, the flagship processor in Tilera's TILE-Gx™ family. This processor family is comprised of a series of high-performance, low-power 64-bit manycore processor SoCs,
tightly coupled with high performance packet processing. These highly integrated processors deliver exceptional performance and performance-per-watt in the embedded networking,
cyber security, and high throughput computing markets. Of particular interest is the iMesh network-on-chip, which scales to 100s of cores and provides high-speed interconnection
of all on-die elements and cache coherence across the chip.
Matthew Mattina, CTO Tilera ![](/hoti21/slides/pdf.jpg)
Bio
Mr. Mattina is the Chief Technology Officer at Tilera and is responsible for processor strategy and technology. As processor architect at Tilera, he co-led the design of the 64-core TILE-Pro, and the 9- to
72-core TILE-Gx processor families. Prior to Tilera, Mr. Mattina was with Intel Corporation where he was co-lead architect for the Tukwila Multicore Processor, supervising a team of architects and designers.
At Intel, Mr. Mattina invented and designed the Intel Ring Uncore Architecture, used across Intel's x86 multicore processor designs. This technology won the Intel Achievement Award in 2010. Prior to Intel, he was an
architect and circuit design engineer at Digital Equipment Corporation, working on the Alpha EV7 and EV8 processors. Mr. Mattina also served as Technical Leader at Cisco Systems in the TelePresence Infrastructure Business Unit,
where he contributed to the hardware and software design of next-generation high-definition video conferencing products. He has been granted over 20 patents and has published journal and conference papers relating to CPU design,
multicore processors, and cache coherence protocols. Mr. Mattina holds a BS in Computer and Systems Engineering from Rensselaer Polytechnic Institute and a MS in Electrical Engineering from Princeton University.
|
15:30-16:00 |
Overview and Next Steps for the Open Compute Project
Abstract
Billions of people and their many devices will be coming online in the next decade, and those who are already online are living ever-more connected
lives. The industry is building out a huge physical infrastructure to support this growth, but we are doing so in a largely closed fashion, inhibiting
the pace of innovation and preventing us from achieving the kinds of efficiencies that might otherwise be possible.
In this talk, John Kenevey will provide an overview of the Open Compute Project, a thriving consumer-led community dedicated to promoting more
openness and a greater focus on scale, efficiency, and sustainability in the development of infrastructure technologies. John will give a brief history
of the project and describe its vision for the future, focusing on a new project within OCP to develop an open network switch.
John Kenevey, Facebook & the Open Compute Project ![](/hoti21/slides/pdf.jpg)
Bio
John has 18 years of experience in the technology sector, spanning startups to small, medium and large cap technology companies. In 2011 John initiated, orchestrated
and founded the Open Compute Project. Two years into the project, OCP has gained traction across the supplier ecosystem with the likes of HP, Dell, AMD and Intel joining and
contributing to the project. Currently, John has shifted his focus to building out an OCP incubation channel to exploit the growing opportunity that the Open Compute Project
has provided. John advises several startups in Silicon Valley. He has a Masters degree in Economics from University College Dublin.
|
16:00-17:00 |
Keynote II |
Session Chair: Torsten Hoefler
|
Networking as a Service
Tom Anderson, University of Washington
|
17:15-18:00 |
Keynote III |
Session Chair: Christos Kolias
|
The Network is the Cloud
David Yen, SVP/GM Data Center Group, Cisco
|
18:00-19:00 |
Head Bubba Memorial Cocktail Reception |
Time |
Thursday, August 22 (Symposium) |
8:00-9:00 |
Breakfast and Registration |
9:00-10:00 |
Keynote IV |
Session Chair: Madeleine Glick
|
Hybrid Datacenter Networks
George Papen, University of California, San Diego
|
10:10-10:30 |
Morning Break |
10:30-12:00 |
OpenFlow and High-Performance Computing |
Session Chair: Mohammad Alizadeh
|
Efficient Security Applications Implementation in OpenFlow Controller with FleXam
Abstract
Current OpenFlow specifications provide limited access to packet-level information such as packet content, making it very inefficient, if not impossible, to deploy security
and monitoring applications as controller applications. In this paper, we propose FleXam, a flexible sampling extension for OpenFlow designed to provide access to packet level
information at the controller.
Simplicity of FleXam makes it possible to implement it easily in OpenFlow switches and operate at line rate without requiring any additional memory. At the same time, its
flexibility allows implementation of various monitoring and security applications in the controller, while maintaining balance between overhead and collected information details.
FleXam realizes the advantages of both proactive and reactive routing schemes by providing a tunable trade-off between the visibility of individual flows, and the controller load.
As an example, we demonstrate how FleXam can be used to implement a port scan detection application with an extremely low overhead.
S. Shirali-Shahreza and Y. Ganjali
OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management
Abstract
Dragonfly networks are appealing topologies for large-scale Datacenter and HPC networks, that provide high throughput with a low diameter and moderate cost. However, they are prone to congestion under certain
frequent traffic patterns that saturate specific network links. Adaptive nonminimal routing can be used to avoid such congestion. That kind of routing employs longer paths to circumvent local or global congested
links. However, if a distance-based deadlock avoidance mechanism is employed, more Virtual Channels (VCs) are required, what increases design complexity and cost.Ê OFAR (On-the-Fly Adaptive Routing) is a routing proposal
that decouples virtual channels from deadlock avoidance, making local and global misrouting affordable. However, the severity of congestion with OFAR is higher, because it relies on an escape network with low bisection bandwidth.
Additionally, OFAR allows for unlimited misrouting on the escape subnetwork, leading to unbounded paths in the network and long latencies.Ê In this paper we propose and evaluate OFAR-CM, a variant of OFAR combined with a simple
congestion management (CM) mechanism which only relies on local information, specifically the credit count of the output ports in the local router. With simple escape networks such as a Hamiltonian ring or a tree, OFAR outperforms
former proposals with distance-based deadlock avoidance. Additionally, although long paths are allowed in theory, in practice packets arrive at their destination in a small number of hops. Altogether, OFAR-CM constitutes the first
practicable mechanism to the date supporting both local and global misrouting in Dragonfly networks.
M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodriguez
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Abstract
The emergence of co-processors such as Intel. Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Further, the
communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication
performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design
collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce
and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the
performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce
operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.
K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri and D. K. Panda
|
12:00-13:00 |
Lunch |
13:00-14:30 |
Short Papers |
Session Chair: Patrick Geoffray
|
On the Data Path Performance of Leaf-Spine Datacenter Fabrics
Abstract
Modern datacenter networks must support a multitude of diverse and demanding workloads at low cost
and even the most simple architectural choices can impact mission-critical application performance.
This forces network architects to continually evaluate tradeoffs between ideal designs and
pragmatic, cost effective solutions. In real commercial environments the number of parameters
that the architect can control is fairly limited and typically includes only the choice of topology,
link speeds, oversubscription, and switch buffer sizes. In this paper we provide some guidance to
the network architect about the impact these choices have on data path performance.
We analyze the behavior of Leaf-Spine topologies under realistic traffic workloads via extensive
simulations and identify what is important for performance and what is not important. We present
intuitive arguments that explain our findings and provide a framework for reasoning about different
design tradeoffs.
M. Alizadeh and T. Edsall
Can Parallel Replication Benefit HDFS for High-Performance Interconnects?
Abstract
The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications. HDFS has been adopted as the underlying file system of numerous data-intensive applications due to
its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to three (default replication factor) DataNodes. The current implementation
of HDFS in Apache Hadoop supports pipelined replication which introduces increased latency for real-time, latency-sensitive applications. In this paper, we have introduced an alternative parallel replication
scheme in both socket-based and RDMA-based design of HDFS over InfiniBand. Parallel replication allows the client to write all the replicas in parallel. We have analysed the challenges and issues of parallel
replication and compared its performance with pipelined replication. With modern high performance networks, parallel replication can offer much better response times for latency-sensitive applications compared to
that of pipelined replication. Experimental results show that, parallel replication can reduce the execution times of TeragGen benchmarks by up to 16% over IPoIB (IP over InfiniBand), 10GigE and RDMA (Remote Direct
Memory Access) over InfiniBand. The throughput of the TestDFSIO benchmark is also increased by 12% over high performance interconnects like IPoIB, 10GigE and RDMA over InfiniBand by parallel replication. It can also
enhance the HBase Put operation performance by 17% for the above-mentioned interconnects and protocols. Whereas, for throughput over networks like 1GigE and also for smaller data sizes, parallel replication does not
benefit the performance.
N. Islam, X. Lu, M. Rahman and D. K. Panda
Interconnect for Tightly Coupled Accelerators Architecture
Abstract
In recent years, heterogenious clusters using accelerators are widely
used for high performance computing system. In such clusters, the
inter-node communication among accelerators requires several memory
copies via CPU memory, and the communication latency causes severe
performance degradation. To address this problem, we propose Tightly
Coupled Accelerators (TCA) architecture to reduce the communication
latency between accelerators over different nodes.
In the TCA architecture, PCI Express packets are directly used for
the communication among accelerators over nodes.
In addition, we designed the communication chip, named PEACH2 chip, to
realize the TCA architecture. In this paper, we introduce the design and
implementation of PEACH2 chip using FPGA, and PEACH2 board as the PCI
extension board is presented.
The GPU cluster with several tens of nodes based on the TCA
architecture will be installed in our center, and
this system will be able to demonstrate the effectiveness of the TCA
architecture.
T. Hanawa, Y. Kodama, T. Boku and M. Sato
Low latency scheduling algorithm for Shared Memory Communication over optical networks
Abstract
Optical Network on Chips (NoCs) based on silicon photonics have been proposed to reduce latency and power consumption in future chip multi-core processors (CMP). However, high performance CMPs
use a shared memory model which generates large numbers of short messages, typically of the order of 8-256B. Messages of this length create high overhead for optical switching systems due to arbitration
and switching times. Current schemes only start the arbitration process when the message arrives at the input buffer of the network. In this paper, we propose a scheme which intelligently uses the information
from the memory controllers to schedule optical paths. We identified predictable patterns of messages associated with memory operations for a 32 core x86 system using the MESI coherency protocol. We used the
first message of each pattern to open the optical paths which will be used by all subsequent messages thereby eliminating arbitration time for the latter. Without considering the initial request message, this
scheme can therefore reduce the time of flight of a data message in the network by 29% and that of a control message by 67%. We demonstrate the benefits of this scheduling algorithm for applications in the PARSEC
benchmark suite with overall average reductions, in terms of the overhead latency per message, of 31.8% for the streamcluster benchmark and up to 70.6% for the swaptions benchmark.
M. Madarbux, P. Watts and A. Van Laer
Bursting Data between Data Centers: Case for Transport SDN
Abstract
Public and Private Enterprise clouds are changing the nature of WAN data center interconnects. Datacenter WAN interconnects today are pre-allocated, static optical trunks of high capacity. These
optical pipes carry aggregated packet traffic originating from within the datacenters while routing decisions are made by devices at the datacenter edges. In this paper, we propose a software-defined
networking enabled optical transport architecture (Transport SDN) that meshes seamlessly with the deployment of SDN within the Data Centers. The proposed programmable architecture abstracts a core transport node
into a programmable virtual switch that leverages the OpenFlow protocol for control. A demonstration use-case of an OpenFlow-enabled optical virtual switch managing a small optical transport network for a big-data
application is described. With appropriate extensions to OpenFlow, we discuss how the programmability and flexibility SDN brings to packet-optical datacenter interconnect will be substantial in solving some of the
complex multi-vendor, multi-layer, multi-domain issues that hybrid cloud providers face.
A. Sadasivarao, S. Syed, P. Pan, C. Liou, A. Lake, C. Guok and I. Monga
|
14:30-15:00 |
Afternoon Break |
15:00-16:00 |
Keynote V |
Session Chair: Christos Kolias
|
Changing Data Center
Tom Edsall, CTO at Insieme Networks
|
16:00-17:30 |
Evening Panel |
Moderator: Mitch Gusat
|
Data-Center Congestion Control: the Neglected Problem
Abstract
After decades of TCP and RED/ECN, extensive experiences with Valiant and hash/ECMP-based load balancing, what have we learned thus far?
Debate on the pros and cons of flow and congestion controls in future DC and HPC networks - considering the specifics of each layer, from L2 link level,
up to L4 transports and L5 applications. The panelists will argue the balance between h/w and s/w solutions, their timescales, costs, and expected real-life
impact.
Also considered will be the related issues of HOL-blocking in various multihop topologies, load balancing, adaptive routing, globalscheduling, OpenFlow options,
new DC-TCP versions and application-levelchanges.And an intriguing new challenge: How about the SDN, aka OVN or virtual DCN?
Mohammad Alizadeh, Insieme Networks/Stanford
Claudio DeSanti, Cisco
Mehul Dholakia, Brocade
Tom Edsall, Insieme Networks
Bruce Kwan, Broadcom
Gilad Shainer, Mellanox
|
17:30-17:45 |
Closing Remarks |
|
|