报告题目:Detecting Large-Scale System Problems by Mining Console Logs
报 告 人:简海燕
报告时间:2018年7月13日 上午 11:30
报告地点:贵州大学北校区博学楼603室
报告内容摘要:
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We use a combination of program analysis and information retrieval techniques to transform free-text console logs into numerical features, which captures sequences of events in the system. We then analyze these features using machine learning to detect operational problems. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. In addition, we extend our methods to online problem detection where the sequences of events are continuously generated as data streams.
报告题目:Revisiting Performance Interference among Consolidated n-Tier Applications: Sharing is Better than Isolation
报 告 人:陈锦秋
报告时间:2018年7月13日 下午 2:00
报告地点:贵州大学北校区博学楼603室
报告内容摘要:
Performance unpredictability is one of the major concerns slowing down the migration of mission-critical applications into cloud computing infrastructures . An example of non-intuitive result is the measured n-tier application performance in a virtualized environment that showed increasing workload caused a competing, co-located constant workload to decrease its response time . In this paper, we investigate the sensitivity of measured performance in relation to two factors: (1) consolidated server specification of virtual machine resource availability, and (2) burstiness of n-tier application workload. Our first and surprising finding is that specifying a complete isolation, e.g., 50-50 even split of CPU between two co-located virtual machines (VMs) results in significantly lower performance compared to a fully-shared allocation, e.g.,up to 100% CPU for both co-located VMs. This happens even at relatively modest resource utilization levels (e.g., 40% CPU in the VMs). Second, we found that an increasingly bursty workload also increases the performance loss among the consolidated servers, even at similarly modest utilization levels (e.g., 70% overall). A potential solution to the first problem (performance loss due to resource allocation) is crosstier-priority scheduling (giving higher priority to shorter jobs),which can reduce the performance loss by a factor of two in our experiments. In contrast, bursty workloads are a more difficult problem: our measurements show they affect both the isolation and sharing strategies in virtual machine resource allocation.
报告题目:Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms
报 告 人:王可
报告时间:2018年7月13日 下午 2:30
报告地点:贵州大学北校区博学楼603室
报告内容摘要:
Cloud research to date has lacked data on the characteristics of the production virtual machine (VM) workloads of large cloud providers. A thorough understanding of these characteristics can inform the providers’ resource management systems, e.g. VM scheduler, power manager, server health manager. In this paper, we fi rst introduce an extensive characterization of Microsoft Azure’s VM workload, including distributions of the VMs’ lifetime, deployment size, and resource consumption. We then show that certain VM behaviors are fairly consistent over multiple lifetimes, i.e. history is an accurate predictor of future behavior. Based on this observation, we next introduce Resource Central (RC), a system that collects VM telemetry, learns these behaviors offline, and provides predictions online to various resource managers via a general client-side library. As an example of RC’s online use, we modify Azure’s VM scheduler to leverage predictions in oversubscribing servers (with oversubscribable VM types), while retaining high VM performance. Using real VM traces, we then show that the prediction-informed schedules increase utilization and prevent physical resource exhaustion. We conclude that providers can exploit their workloads’ characteristics and machine learning to improve resource management substantially.
报告题目:When Average is Not Average-Large Response Time Fluctuations in n-Tier Systems
报 告 人:许振雪
报告时间:2018年7月13日 下午 3:00
报告地点:贵州大学北校区博学楼603室
报告内容摘要:
Simultaneously achieving good performance and high resource utilization is an important goal for production cloud environments. Through extensive measurements of an n-tier application benchmark (RUBBoS), we show that system response time frequently presents large scale fluctuations (e.g., ranging from tens of milliseconds up to tens of seconds) during periods of high resource utilization. Except the factor of bursty workload from clients, we
found that the large scale response time fluctuations can be caused by some system environmental conditions (e.g., L2 cache miss, JVM garbage collection, inefficient scheduling policies) that commonly exist in n-tier applications. The impact of these system environmental conditions can largely amplify the end-to-end response time fluctuations because of the complex resource dependencies in the system. For instance, a 50ms response time increase in the database tier can be amplified to 500ms end-to-end response time increase. We evaluate three heuristics to stabilize response time fluctuations while still achieving high resource utilization in the system. Our results show that large scale response time fluctuations should be taken into account when designing effective autonomous self-scaling n-tier systems in cloud.
报告题目:DCM: Dynamic Concurrency Management for Scaling n-Tier Applications in Cloud
报 告 人:李雨杰
报告时间:2018年7月13日 下午 3:30
报告地点:贵州大学北校区博学楼603室
报告内容摘要:
Scaling web applications such as e-commerce in cloud by adding or removing servers in the system is an important practice to handle workload variations, with the goal of achieving both high quality of service (QoS) and high resource efficiency.Through extensive scaling experiments of an n-tier application benchmark (RUBBoS), we have observed that scaling only hardware resources without appropriate adaptation of soft resource allocations (e.g., thread or connection pool size)of each server would cause significant performance degradation of the overall system by either under- or over-utilizing the bottleneck resource in the system.We develop a dynamic concurrency management (DCM) framework which integrates soft resource allocations into the system scaling management. DCM introduces a model which determines a near-optimal concurrency setting to each tier of the system based on a combination of operational queuing laws and online analysis of fine-grained measurement data.We implement DCM as a two-level actuator which scales both hardware and soft resources in an n-tier system on the fly without interrupting the runtime system performance. Our experimental results demonstrate that DCM can achieve significantly more stable performance and higher resource efficiency compared to the state-of-the-art hardware-only scaling solutions (e.g., Amazon EC2-AutoScale) under realistic bursty workload traces.
报告题目:Rapidly Alternating Bottlenecks: A Study of Two Cases in n-Tier Applications
报 告 人:陈帝
报告时间:2018年7月13日 下午 4:00
报告地点:贵州大学北校区博学楼603室
报告内容摘要:
Identifying the location of performance bottlenecks is a non-trivial challenge when scaling n-tier applications in Computing cloud. specific, we observed that an n-tier application may experience significant performance loss when bottlenecks alternate rapidly between component servers. Such rapidly alternating bottlenecks arise naturally and often from resource dependencies in an n-tier system and bursty workloads. These rapidly alternating bottlenecks are difficult to detect because the saturation in each participating server may have a very short lifespan (e.g., milliseconds) compared to current system monitoring tools and methods with sampling at intervals of seconds or using passive network tracing at fine-granularity (e.g., aggregate at every 50ms), we are able to correlate throughput (i.e., request service rate) and queue length (i.e., number of concurrent requests) in each server of an n-tier system. Our experimental results show conclusive evidence of rapidly alternating bottlenecks caused by system software (JVM garbage collection) and middleware (VM collocation).