Date of Original Version
Abstract or Table of Contents
Localizing performance problems (or fingerpointing) is essential for distributed systems such as Hadoop that support long-running, parallelized, data-intensive computations over a large cluster of nodes. Manual fingerpointing does not scale in such environments because of the number of nodes and the number of performance metrics to be analyzed on each node. ASDF is an automated, online fingerpointing framework that transparently extracts and parses different time-varying data sources (e.g., sysstat, Hadoop logs) on each node, and implements multiple techniques (e.g., log analysis, correlation, clustering) to analyze these data sources jointly or in isolation. We demonstrate ASDF’s online fingerpointing for documented performance problems in Hadoop, under different workloads; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time, and exhibits average online fingerpointing latencies of less than 1 minute with false-positive rates of less than 1%.