FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation

  • VLDB, Aug. 2024 [PDF][Code][Slides]

Abstract: Data augmentation enhances the accuracy of DL models by diversifying training samples through a sequence of data transformations. While recent advancements in data augmentation have demonstrated remarkable efficacy, they often rely on computationally expensive and dynamic algorithms. Unfortunately, current system optimizations, primarily designed to leverage CPUs, cannot effectively support these methods due to costs and limited resource availability. To address these issues, we introduce FusionFlow, a system that cooperatively utilizes both CPUs and GPUs to accelerate the data preprocessing stage of DL training that runs the data augmentation algorithm. FusionFlow orchestrates data preprocessing tasks across CPUs and GPUs while minimizing interference with GPU-based model training. In doing so, it effectively mitigates the risk of GPU memory overflow by managing memory allocations of the tasks within the GPU-wide free space. Furthermore, FusionFlow provides a dynamic scheduling strategy for tasks with varying computational demands and reallocates compute resources on-the-fly to enhance training throughput for both single and multi-GPU DL jobs. Our evaluations show that FusionFlow outperforms existing CPU-based methods by 16–285% in single-machine scenarios and, to achieve similar training speeds, requires 50–60% fewer CPUs compared to utilizing scalable compute resources from external servers.

Blaze: Holistic Caching for Iterative Data Processing

Abstract:  Modern data processing workloads, such as machine learning and graph processing, involve iterative computations to converge generated models into higher accuracy. An effective caching mechanism is vital to expedite iterative computations since the intermediate data that needs to be stored in memory grows larger over iterations, often exceeding the memory capacity. However, existing systems handle intermediate data through separate operational layers (e.g., caching, eviction, and recovery), with each layer working independently in a greedy or cost-agnostic manner. These layers typically rely on user annotations and past access patterns, failing to make globally optimal decisions for the workload. To overcome these limitations, Blaze introduces a unified caching mechanism that integrates the separate operational layers. Blaze dynamically captures the workload structure and metrics using profiling and inductive regression, and automatically estimates the potential data caching efficiency associated with different operational decisions based on the profiled information. To achieve this goal, Blaze incorporates potential data recovery costs across stages into a single cost optimization function, which informs the optimal partition state and location. This approach reduces the significant disk I/O overheads caused by oversized partitions and the recomputation overheads for partitions with long lineages, while efficiently utilizing the constrained memory space. Our evaluations demonstrate that Blaze can accelerate end-to-end application completion time by 2.02 − 2.52× compared to recomputation-based MEM_ONLY Spark, and by 1.08 − 2.86× compared to checkpoint-based MEM+DISK Spark, while reducing the cache data stored on disk by 95% on average.

Cost-effective On-device Continual Learning over Memory Hierarchy with Miro

Abstract: Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.

Sponge: Fast Reactive Scaling for Stream Processing with Serverless Frameworks

Abstract: Streaming workloads deal with data that is generated in realtime. This data is often unpredictable and changes rapidly in volume. To deal with these fluctuations, current systems aim to dynamically scale in and out, redistribute, and migrate computing tasks across a cluster of machines. While many prior works have focused on reducing the overhead of system reconfiguration and state migration on pre-allocated cluster resources, these approaches still face significant challenges in meeting latency SLOs at low operational costs, especially upon facing unpredictable bursty loads.

     In this paper, we propose Sponge, a new stream processing system that enables fast reactive scaling of long-running stream queries by leveraging serverless framework (SF) instances. Sponge absorbs sudden, unpredictable increases in input loads from existing VMs with low latency and cost by taking advantage of the fact that SF instances can be initiated quickly, in just a few hundred milliseconds. Sponge efficiently tracks a small number of metrics to quickly detect bursty loads and make fast scaling decisions based on these metrics. Moreover, by incorporating optimization logic at compile-time and triggering fast data redirection and partial-state merging mechanisms at runtime, Sponge avoids optimization and state migration overheads during runtime while efficiently offloading bursty loads from existing VMs to new SF instances. Our evaluation on AWS EC2 and Lambda using the NEXMark benchmark shows that Sponge promptly reacts to bursty input loads, reducing 99th-percentile tail latencies by 88% on average compared to other stream query scaling methods on VMs. Sponge also reduces cost by 83% compared to methods that over-provision VMs to handle unpredictable bursty loads.

EnvPipe: Performance-preserving DNN Training Framework for Saving Energy

Abstract: Energy saving is a crucial mission for data center providers. Among many services, DNN training and inference are significantcontributors to energy consumption. This work focuseson saving energy in multi-GPU DNN training. Typically, energysavings come at the cost of some degree of performancedegradation. However, determining the acceptable level ofperformance degradation for a long-running training job canbe difficult.
This work proposes ENVPIPE, an energy-saving DNN training framework. ENVPIPE aims to maximize energy saving while maintaining negligible performance slowdown. ENVPIPE takes advantage of slack time created by bubbles in pipeline parallelism. It schedules pipeline units to place bubbles after pipeline units as frequently as possible and then stretches the execution time of pipeline units by lowering the SM frequency. During this process, ENVPIPE does not modify hyperparameters or pipeline dependencies, preserving the original accuracy of the training task. It selectively lowers the SM frequency of pipeline units to avoid performance degradation. We implement ENVPIPE as a library using PyTorch and demonstrate that it can save up to 25.2% energy in singlenode training with 4 GPUs and 28.4% in multi-node training with 16 GPUs, while keeping performance degradation to less than 1%.

SWAN: WAN-aware Stream Processing on Geographically-distributed Clusters

Abstract: Wide-area stream analytics is commonly being used to extract operational or business insights from the data issued from multiple distant datacenters. However, timely processing of such data streams is challenging because wide-area network (WAN) bandwidth is scarce and varies widely across both different geo-locations (i.e., spatially) and points of time (i.e., temporally). Stream analytics desirable under a WAN setup requires the consideration of path diversity and the associated bandwidth from data source to sink when performing operator task placement for the query execution plan. It also has to enable fast adaptation to dynamic resource conditions, e.g., changes in network bandwidth, to keep the query execution stable.

    We present SWAN, a WAN stream analytics engine that incorporates two key techniques to meet the aforementioned requirements. First, SWAN provides a fast heuristic model that captures WAN characteristics at runtime and evenly distributes tasks to nodes while maximizing the network bandwidth for intermediate data. Second, SWAN exploits a stream relaying operator (or RO) to extend a query plan for better facilitating path diversity. This is driven by our observation that oftentimes, a longer path with more communication hops provides higher bandwidth to reach the data sink than a shorter path, allowing us to trade-off query latency for higher query throughput. SWAN stretches a given query plan by adding ROs at compile time to opportunistically place it over such a longer path. In practice, throughput gains do not necessarily lead to significant latency increases, due to higher network bandwidth providing more in-flight data transfers. Our prototype improves the latency and the throughput of stream analytics performances by 77.6% and 5.64×, respectively, compared to existing

Sibylla: To Retry or Not To Retry on Deep Learning Job Failure

Abstract: GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our analysis with a real-world trace reveals that a non-negligible number of jobs running on the cluster undergo failures and are blindly retried by the job scheduler. Unfortunately, these job failures often repeat and waste GPU resources, limiting effective GPU utilization across the cluster. In this paper, we introduce Sibylla which informs whether an observed failure of DL training will repeat or not upon retry on the failure. Sibylla employs a machine learning model based on RNNs that trains on stdout and stderr logs of failed jobs and can continuously update the model on new log messages without hand-constructing labels for the new training samples. With Sibylla, the job scheduler is learning-enhanced, performing a retry for a failed job only when it is highly likely to succeed with the retry. We evaluate the effectiveness of Sibylla under a variety of scenarios using trace-driven simulations. Sibylla improves cluster utilization and reduces job completion time (JCT) by up to 15%.

Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory

Abstract: With the ever-growing demands for GPUs, most organizations allow users to share the multi-GPU servers. However, we observe that the memory space across GPUs is not effectively utilized enough when consolidating various workloads that exhibit highly varying resource demands. This is because the current memory management techniques were designed solely for individual GPUs rather than shared multi-GPU environments.
This study introduces a novel approach to provide an illusion of virtual memory space for GPUs, called hierarchical unified virtual memory (HUVM), by incorporating the temporarily idle memory of neighbor GPUs. Since modern GPUs are connected to each other through a fast interconnect, it provides lower access latency to neighbor GPU’s memory compared to the host memory via PCIe. On top of HUVM, we design a new memory manager, called memHarvester, to effectively and efficiently harvest the temporarily available neighbor GPUs’ memory. For diverse consolidation scenarios with DNN training and graph analytics workloads, our experimental result shows up to 2.71× performance improvement compared to the prior approach in multi-GPU environments.

CarM: Hierarchical Episodic Memory for Continual Learning

Abstract: Continual Learning (CL) is an emerging machine learning paradigm in mobile or IoT devices that learns from a continuous stream of tasks. To avoid forgetting of knowledge of the previous tasks, episodic memory (EM) methods exploit a subset of the past samples while learning from new data. Despite the promising results, prior studies are mostly simulation-based and unfortunately do not promise to meet an insatiable demand for both EM capacity and system efficiency in practical system setups. We propose CarM, the first CL framework that meets the demand by a novel hierarchical EM management strategy. CarM has EM on high-speed RAMs for system efficiency and exploits the abundant storage to preserve past experiences and alleviate the forgetting by allowing CL to efficiently migrate samples between memory and storage. Extensive evaluations show that our method significantly outperforms popular CL methods while providing high training efficiency

Jarvis: Large-scale Server Monitoring with Adaptive Near-data Processing

Abstract:Rapid detection and mitigation of issues that impact performance and reliability are paramount for large-scale online services. For real-time detection of such issues, datacenter operators use a stream processor and analyze streams of monitoring data collected from servers (referred to as data source nodes) and their hosted services. The timely processing of incoming streams requires the network to transfer massive amounts of data, and significant compute resources to process it. These factors often create bottlenecks for stream analytics. To help overcome these bottlenecks, current monitoring systems employ near-data processing by either computing an optimal query partition based on a cost model or using model-agnostic heuristics. Optimal partitioning is computationally expensive, while model-agnostic heuristics are iterative and search over a large solution space. We combine these approaches by using model-agnostic heuristics to improve the partitioning solution from a model-based heuristic. Moreover, current systems use operator-level partitioning: if a data source does not have sufficient resources to execute an operator on all records, the operator is executed only on the stream processor. Instead, we perform data-level partitioning—i.e., we allow an operator to be executed both on a stream processor and data sources. We implement our algorithm in a system called Jarvis, which enables quick adaptation to dynamic resource conditions. Our evaluation on a diverse set of monitoring workloads suggests that Jarvis converges to a stable query partition within seconds of a change in node resource conditions. Compared to current partitioning strategies, Jarvis handles up to 75% more data sources while improving throughput in resource-constrained scenarios by 1.2-4.4×.

Streaming Analytics with Adaptive Near-data Processing

Abstract: Streaming analytics applications need to process massive volumes of data in a timely manner, in domains ranging from datacenter telemetry and geo-distributed log analytics to Internet-of-Things systems. Such applications suffer from significant network transfer costs to transport the data to a stream processor and compute costs to analyze the data in a timely manner. Pushing the computation closer to the data source by partitioning the analytics query is an effective strategy to reduce resource costs for the stream processor. However, the partitioning strategy depends on the nature of resource bottleneck and resource variability that is encountered at the compute resources near the data source. In this paper, we investigate different issues which affect query partitioning strategies. We first study new partitioning techniques within cloud datacenters which operate under constrained compute conditions varying widely across data sources and different time slots. With insights obtained from the study, we suggest several different ways to improve the performance of stream analytics applications operating in different resource environments, by making effective partitioning decisions for a variety of use cases such as geo-distributed streaming analytics.

Zico: Efficient GPU Memory Sharing for Concurrent DNN Training

Abstract: GPUs are the workhorse in modern server infrastructure fueling advances in a number of compute-intensive workloads such as deep neural network (DNN) training. Several recent works propose solutions on sharing GPU resources across multiple concurrent DNN training jobs, but none of them address rapidly increasing memory footprint introduced by such job co-locations, which greatly limit the effectiveness of sharing GPU resources. In this paper, we present Zico, the first DNN system that aims at reducing the system-wide memory consumption for concurrent training. Zico keeps track of the memory usage pattern of individual training job by monitoring its progress on GPU computations and makes memory reclaimed from the job globally sharable. Based on this memory management scheme, Zico automatically decides a strategy to share memory among concurrent jobs with minimum delay on training while not exceeding a given memory budget such as GPU memory capacity. Our evaluation shows that Zico outperforms existing GPU sharing approaches and delivers benefits over a variety of job co-location scenarios.

Reliability of Large-scale GPU Clusters for Deep Learning Workloads

Abstract: Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a largescale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences.

Approximate Quantiles for Datacenter Telemetry Monitoring

Abstract: Datacenter systems require real-time troubleshooting so as to minimize downtimes. In doing so, datacenter operators employ streaming analytics for collecting and processing datacenter telemetry over a temporal window. Quantile computation is key to this telemetry monitoring since it can summarize the typical and abnormal behavior of the monitored system. However, computing quantiles in real-time is resource-intensive as it requires processing hundreds of millions of events in seconds while providing high accuracy. To address these challenges, we propose AOMG, an efficient and accurate quantile approximation algorithm that capitalizes insights from our workload study. AOMG improves performance through two-level hierarchical windowing while offering small value errors in a wide range of quantiles by taking into account the density of underlying data distribution. Our evaluations show that AOMG estimates the exact quantiles with less than 5% relative value error for a variety of use cases while providing high throughput.

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Abstract: With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in Microsoft. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory

Abstract: Stream analytics has an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacitylimited and (2) HBM boosts performance best for sequential access and high parallelismworkloads. At first glance, stream analytics appears a particularly poor match for HBM because they have high capacity demands and data grouping operations, their most demanding computations, use random access.

    This paper presents the design and implementation of StreamBox-HBM, a stream analytics engine that exploits hybrid memories to achieve scalable high performance. StreamBox-HBM performs data grouping with sequential access sorting algorithms in HBM, in contrast to random access hashing algorithms commonly used in DRAM. StreamBox- HBM solely uses HBM to store Key Pointer Array (KPA) data structures that contain only partial records (keys and pointers to full records) for grouping operations. It dynamically creates and manages prodigious data and pipeline parallelism, choosing when to allocate KPAs in HBM. It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM.

Tiresias: A GPU Cluster Manager for Distributed Deep Learning

Abstract: Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers cause long queueing delays and low overall performance. We present Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCTs). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms – Discretized Two- Dimensional Gittins index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic – that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed, and present a placement algorithm to leverage these observations without any user input. Experiments on the Michigan ConFlux cluster with 60 P100 GPUs and largescale trace-driven simulations show that Tiresias improves the average JCT by up to 5:5 over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.

Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement

  • Jay H. Park, Sunghwan Kim, Jinwon Lee, Myeongjae Jeon, Sam H. Noh
  • PreprintJan. 2019 [PDF]

Abstract: The Convolutional Neural Network (CNN) model, often used for image classification, requires significant training time to obtain high accuracy. To this end, distributed training is performed with the parameter server (PS) architecture using multiple servers. Unfortunately, scalability has been found to be poor in existing architectures. We find that the PS network is the bottleneck as it communicates a large number of gradients and parameters with the many workers. This is because synchronization with the many workers has to occur at every step of training. Depending on the model, communication can be in the several hundred MBs per synchronization. In this paper, we propose a scheme to reduce network traffic through layer placement that considers the resources that each layer uses. Through analysis of the characteristics of CNN, we find that placement of layers can be done in an effective manner. We then incorporate this observation within the TensorFlow framework such that layers can be automatically placed for more efficient training. Our evaluation making use of this placement scheme show that training time can be significantly reduced without loss of accuracy for many CNN models.

TerseCades: Efficient Data Compression in Stream Processing

  • Gennady Pekhimenko, Chuanxiong Guo, Myeongjae Jeon, Peng Huang, Lidong Zhou
  • USENIX ATC, Jul. 2018

StreamBox: Modern Stream Processing on a Multicore Machine

  • Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, Felix Xiaozhu Lin
  • USENIX ATC, Jul. 2017

SSD Failures in Datacenters: What, When and Why?

  • Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid
  • ACM SIGMETRICS / IFIP Performance, Jun. 2016 [PDF] (Poster)
  • ACM SYSTOR, Jun. 2016 (Best Student Paper) [PDF]

TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services

  • Myeongjae Jeon, Yuxiong He, Hwanju Kim, Sameh Elnikety, Scott Rixner, Alan L. Cox
  • ASPLOS, Apr. 2016

Predictive Parallelization: Taming Tail Latencies in Web Search

Reducing DRAM Row Activations with Eager Read/Write Clustering

Adaptive Parallelism for Web Search

  • Myeongjae Jeon, Yuxiong He, Sameh Elnikety, Alan L. Cox, Scott Rixner
  • ACM EuroSys, Apr. 2013 [PDF][Slides(pptx)]

Workload Characterization and Performance Implications of Large-Scale Blog Servers

Energy Reduction in Consolidated Servers Through Memory-Aware Virtual Machine Scheduling

Replicated Abstract Data Types: Building Blocks for Collaborative Applications

Measurement, Modeling, and Analysis of a Large-scale Blog Server Workload

Log' version vector: Logging version vectors concisely in dynamic replication

Guest-Aware Priority-based Virtual Machine Scheduling for Highly Consolidated Server

Domain Level Page Sharing in Xen Virtual Machine Systems