SC17 Papers


The following papers have been accepted for the INDIS 2017 workshop program. (View the 2017 workshop program.)

Qiao Xiang, Xin Wang, Jingxuan Zhang, Harvey Newman, Y. Richard Yang and Yang Liu. "Unicorn: Unified Resource Orchestration for Multi-Domain, Geo-Distributed Data Analytics"

Abstract: As the data volume increases exponentially over time, data-intensive analytics benefits substantially from multi-organizational, geographically-distributed, collaborative computing, where different organizations contribute various yet scarce resources, e.g., computation, storage and networking resources, to collaboratively collect, share and analyze extremely large amounts of data. By analyzing the data analytics trace from the Compact Muon Solenoid (CMS) experiment, one of the largest scientific experiments in the world, and systematically examining the design of existing resource management systems for clusters, we show that the multi-domain, geo-distributed, resource-disaggregated nature of this new paradigm calls for a framework to manage a large set of distributively-owned, heterogeneous resources, with the objective of efficient resource utilization, following the autonomy and privacy of different domains, and that the fundamental challenge for designing such a framework is: how to accurately discover and represent resource availability of a large set of distributively-owned, heterogeneous resources across different domains with minimal information exposure from each domain? Existing resource management systems are designed for single-domain clusters and cannot address this challenge. In this paper, we design Unicorn, the first unified resource orchestration framework for multi-domain, geo-distributed data analytics. In Unicorn, we encode the resource availability for each domain into resource state abstraction, a variant of the network view abstraction extended to accurately represent the availability of multiple resources with minimal information exposure using a set of linear inequalities. We then design a novel, efficient cross-domain query algorithm to discover and integrate the accurate, minimal resource availability information for a set of data analytics jobs across different domains. In addition, Unicorn also contains a global resource orchestrator that computes optimal resource allocation decisions for data analytics jobs. We discuss the implementation of Unicorn and present preliminary evaluation results to demonstrate the efficiency and efficacy of Unicorn. We will also give a full demonstration of the Unicorn system at SuperComputing 2017.

Joaquin Chung, Rajkumar Kettimuthu, Nam Pho, Russ Clark and Henry Owen. "Orchestrating Intercontinental Advance Reservations with Software-Defined Exchanges"

Abstract: To interconnect research facilities across wide geographic areas, network operators deploy science networks, also referred to as Research and Education (R&E) networks. These networks allow experimenters to establish dedicated circuits between research facilities for transferring large amounts of data, by using advanced reservation systems. Intercontinental dedicated circuits typically require coordination between multiple administrative domains, which need to reach an agreement on a suitable advance reservation. The success rate of finding an advance reservation decreases as the number of participant domains increases for traditional systems because the circuit is composed over a single path. To improve provisioning of multi-domain advance reservations, we propose an architecture for end-to-end service orchestration in multi-domain science networks that leverages software-defined exchanges (SDX) for providing multi-path, multi-domain advance reservations. We have implemented an orchestrator for multi-path, multi-domain advance reservations and an SDX to support these services. Our orchestration architecture enables multi-path, multi-domain advance reservations and improves the reservation success rate from 50% in single path systems to 99% when four path are available.

Anshuman Chabbra and Mariam Kiran. "Classifying Elephant and Mice Flows in High-Speed Scientific Networks"

Abstract: Complex science workflows usually involve very large data demands and resource-intensive computations. These demands need a reliable high-speed network to continually optimize its performance for the application data flows. Characterizing these flows into large flows (elephant) versus small flows (mice) can allow networks to optimize their performance by detecting and handling these demands in real-time. However, predicting elephant versus mice flows is extremely difficult as their definition varies based on networks.

Machine learning techniques can help classify flows into two distinct clusters to identify characteristics of transfers. In this paper, we investigate unsupervised and semi-supervised machine learning approaches to classify flows in real time. We develop a Gaussian Mixture Model combined with an initialization algorithm, to develop a novel general-purpose method to help classification based on network sites (in terms of data transfers, flow rates and durations). Our results show that despite of variable flows at each site, the proposed algorithm is able to cluster elephants and mice with accuracy rate of 90%. We analyzed NetFlow reports of 1 month from 3 ESnet site routers to train the model and predict clusters.

Ralph Koning, Ben de Graaff, Robert Meijer, Cees de Laat and Paola Grosso. "Measuring the efficiency of SDN mitigations against cyber attacks"

Abstract: To address increasing problems caused by cyber attacks, we leverage Software Defined networks and Network Function Virtualisation governed by a SARNET-agent to enable autonomous response and attack mitigation. A Secure Autonomous Response Network (SARNET) uses a control loop to constantly assess the security state of the network by means of observables. Using a prototype we introduce the metrics impact and efficiency and show how they can be used to compare and evaluate countermeasures. These metrics become building blocks for self learning SARNET which exhibit true autonomous response.

Jordi Ros-Giralt. "Algorithms and Data Structures to Accelerate Network Analysis"

Abstract: As the sheer amount of computer generated data continues to grow exponentially, new bottlenecks are unveiled that require rethinking our traditional software and hardware architectures. In this paper we present five algorithms and data structures (long queue emulation, lockless bimodal queues, tail early dropping, LFN tables, and multiresolution priority queues) designed to optimize the process of analyzing network traffic. We integrated these optimizations on R-Scope, a high performance network appliance that runs the Bro network analyzer, and present benchmarks showcasing performance speed ups of 5X at traffic rates of 10 Gbps. 

Chen Xu, Peilong Li and Yan Luo. "A Programmable Policy Engine to Facilitate Time-efficient Science DMZ Management"

Abstract: The Science DMZ model employs dedicated network infrastructures and advanced software techniques for large-volume scientific research traffic flows targeting high-throughput and low-latency data transfer. However, current Science DMZ framework lacks of efficient means of user- intent expression and suffers from slow service-delivery due to the manual work involved in the management loop. As a result, a programmable interface that facilitates user-administrator communication in a time-efficient manner is highly demanded. In this paper, we introduce FLowell, an enhanced SDN-powered Science DMZ model deployed on our campus network. Moreover, we propose a programmable policy engine atop the SDN controller that allows network administrators to implement configuration policies in order to manage the network, while simultaneously offering rapid response time network resource request policies for end users. Our experiment results show that user intent in FLowell can be responded and serviced within 1 second. In addition, FLowell reduces the network latency for the research network path by 35%, and boost the disk-to-disk throughput by up to the 10 Gbps line rate.

Eric Pouyoul, Mariam Kiran, Nathan Hanford, Dipak Ghosal, Fatemah Alali, Raj Kettimuthu and Ben Mackcrane. "Calibers: A Bandwidth Calendaring Paradigm For Science Workflows"

Abstract: Many scientific workflows require large data transfers between distributed instrument facilities, storage and computing resources. To ensure that these resources are maximally utilized, R&E networks connecting these resources, must ensure that there is no bottleneck. However, running the network at high utilization often results in congestion and poor end-to-end TCP throughput performance and/or fairness. This in turn leads to unpredictability in transfer time and poor utilization of distributed resources. Calibers (Calender and Large-scale Bandwidth Event-driven Simulations) aims to advance state-of-the-art in traffic engineering by leveraging SDN-based network architecture and flow pacing algorithms to provide predictable data transfers performance and higher network utilization. Calibers highlights how by intelligently and dynamically shaping flows, we can maximize the number of flows that achieve deadline while improving network resource utilization. [/size] [size= 8pt]In this paper, we present a prototype architecture for Calibers that uses a central controller with distributed agents to dynamically pace flows at the ingress of the network to meet deadlines. Using Globus/Grid-FTP, we experimentally demonstrate that pacing can be used to meet data transfer deadlines which cannot be achieved using TCP. Finally, we present dynamic flow pacing algorithms that maximize acceptance ratio of flows for which deadlines can be met while maximizing network utilization. Our results show that simple heuristics that optimizes locally on the most bottlenecked link can perform almost as well as heuristics that attempt to optimize globally.

Shilpi Bhattacharyya, Dimitrios Katramatos and Shinjae Yoo. "Why wait? Let's start computing while data is still on the wire"

Abstract: In this era of Big Data, computing useful information from data is becoming increasingly complicated, particularly due to the ever increasing volumes of data that need to travel over the network to data centers to be stored and processed, all highly expensive operations in the long haul. In this paper we suggest that we can do computing and analysis of data "on the wire," i.e., while data is still in transit. The nature of these computations include analysis, visualization, pattern recognition, and prediction, or forecasting, on the streaming data. We follow a service function chaining architecture to implement this, assuming that the data packets arrive within a single network administrative domain. As a demonstration of this new computing paradigm, we present three examples. Firstly, we demonstrate pattern recognition and data visualization on streaming forex data, which can be used for lucrative trading in the forex market. In our second example, we analyze and learn user buying patterns from clickstream data streaming from multiple websites. Finally, we monitor solar sensors for a zero reading while the packets are still on their way to the data center, to schedule any maintenance and requisite repairs with no time delay.

Muthu Baskaran, David Bruns-Smith, Thomas Henretty, James Ezick and Richard Lethin. "Enhancing Network Visibility and Security through Tensor Analysis"

Abstract: The increasing size, variety, rate of growth and change, and complexity of network data has warranted advanced network analysis and services. Tools that provide automated analysis through traditional or advanced signature-based systems or machine learning classifiers suffer from practical difficulties. These tools fail to provide comprehensive and contextual insights into the network when put to practical use in operational cyber security. In this paper, we present an effective tool for network security and traffic analysis that uses high-performance data analytics based on a class of unsupervised learning algorithms called tensor decompositions. The tool aims to provide a scalable analysis of the network traffic data and also reduce the cognitive load of network analysts and be network-expert-friendly by presenting clear and actionable insights into the network. [/size] [size= 8pt]In this paper, we demonstrate the successful use of the tool in two completely diverse operational cyber security environments, namely, (1) security operations center (SOC) for the SCinet network at SC16 - The International Conference for High Performance Computing, Networking, Storage and Analysis and (2) Reservoir Labs’ Local Area Network (LAN). In each of these environments, we produce actionable results for cyber security specialists including (but not limited to) (1) finding malicious network traffic involving internal and external attackers using port scans, SSH brute forcing, and NTP amplification attacks, (2) uncovering obfuscated network threats such as data exfiltration using DNS port and using ICMP traffic, and (3) finding network misconfiguration and performance degradation patterns.

Lukasz Makowski, Cees de Laat and Paola Grosso. "Evaluation of virtualization and traffic filtering methods for container networks"

Abstract: Future distributed scientific applications will rely on containerisation for data handling and processing. The question is whether container networking, and the associated technologies, are already mature enough to support the level of usability required in these environments. With the work we present in this article we set out to experiment and evaluate three novel technologies that support addressing and filtering: EVPN, ILA and Cilium/eBPF. Our evaluation shows that different level of maturity, with EVPN more suitable for adoption. Our work also indicates that to support true multi-tenancy further integration of addressing technologies and filtering technologies is needed.

Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster and Peter Beckman. "Towards a Smart Data Transfer Node"

Abstract: Scientific computing systems are becoming significantly more complex, with distributed teams and complex workflows spanning resources from telescopes and light sources to fast networks and smart IoT sensor systems. In such settings, no single, centralized, administrative team and software stack can coordinate and manage all resources used by a single application. Indeed, it appears likely that we have reached a critical limit in manageability using current human-in-the-loop techniques. Instead, we argue that resources must begin to respond autonomically, adapting and tuning their behavior in response to observed properties of scientific workflows. Over time, machine learning methods can be used to identify effective strategies for autonomic, goal-driven management behaviors that can be applied end-to-end across the scientific computing landscape. Using the data transfer nodes that are widely deployed in modern research networks as an example, this paper explores the architecture, methods, and algorithms needed to support future scientific computing systems that self tune and self manage.

Rajkumar Kettimuthu, Zhengchun Liu, Ian Foster, Katrin Heitmann, Franck Cappello and David Wheeler. "Transferring a Petabyte in a Day"

Abstract: Extreme-scale simulations and experiments can generate large amounts of data, whose volume can exceed the compute and/or storage capacity at the simulation or experimental facility. With the emergence of ultra-high-speed networks, it becomes feasible to consider pipelined approaches in which data are passed to a remote facility for analysis. Here we examine the case of an extreme-scale cosmology simulation that, when run on a large fraction of a leadership-scale computer, generates data at a rate of one petabyte per elapsed day. Writing those data to disk is inefficient and impractical, and in situ analysis poses its own difficulties. Thus we implement a pipeline in which data are generated on one supercomputer and then transferred, as they are generated, to a remote supercomputer for analysis. We use the Swift scripting language to instantiate this pipeline across Argonne National Laboratory and the National Center for Supercomputing Applications, which are connected by a 100 Gbps network, and demonstrate that by using the Globus transfer service we can achieve a sustained rate of 93 Gbps over a 24-hour period and thus achieve our performance goal of one petabyte moved in 24 hours. This paper describes the methods used and summarizes the lessons learned in this demonstration.

Download the archive of INDIS 2017 papers (10MB).