Carnegie Mellon University
Browse
Automated Diagnosis of Chronic Performance Problems in Production.pdf (12.92 MB)

Automated Diagnosis of Chronic Performance Problems in Production Systems

Download (12.92 MB)
thesis
posted on 2013-05-01, 00:00 authored by Soila Kavulya

Large production systems are susceptible to chronic performance problems where the system still works, but with degraded performance. Chronic performance problems occur intermittently or affect a subset of end-users. Traditional approaches for diagnosis typically rely on a bottom-up approach that localizes problems by correlating low-level alarms (such as resource utilization indicators or network packet loss) across components in a production system. However, these alarm-correlation approaches fall short when diagnosing chronics because they fail to provide the necessary application-level visibility to detect chronics effectively. Due to the scale and complexity of production systems, there can be multiple unresolved chronics at any given time¶their symptoms often overlap with each other, and they are sometimes triggered by complex corner cases.

This dissertation presents a top-down diagnostic framework for diagnosing chronic performance problems in production systems. The framework comprises of four components. First, an extensible log-analysis framework that extracts end-to-end causal flows using common white-box (i.e., application) logs in the production system; these end-to-end flows capture the user’s experience with the system. Second, anomaly-detection tools exploit heuristics and a peer-comparison approach to label each end-to-end flow as successful or failed. Third, a top-down statistical diagnostic tool combines white-box metrics with blackbox metrics (e.g., CPU usage) to localize the source of the problem by identifying attributes that are more correlated with failed flows than successful ones. Fourth, a visualization tool that uses peer-comparison to highlight anomalous nodes in a parallel-computing cluster.

The diagnostic framework has been used to localize real incidents at an academic cloudcomputing cluster that runs the Hadoop parallel-processing framework, and a production Voice-over-IP system at a major Internet Services Provider. Our approach is not limited to these two systems and is applicable to systems such as Internet Services that serve users via independent interactions.

History

Date

2013-05-01

Degree Type

  • Dissertation

Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Priya Narasimhan

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC