Date of Original Version



Conference Proceeding

Rights Management

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Abstract or Description

Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approachesto providing reliability for memory devices pessimistically treat all data as equally vulnerable to memoryerrors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. For example, we found that while traditional error protection increasesmemory system cost by 12.5%, some applications can achieve 99.00% availability on a single server with a large number of memory errors without any error protection. This presents an opportunity togreatly reduce server hardware cost by provisioning the right amount of memory reliability for differentapplications. Toward this end, in this paper, we make three main contributions to enable highly-reliable servers at low datacenter cost. First, we develop a new methodology to quantify the tolerance ofapplications to memory errors. Second, using our methodology, we perform a case study of three new dataintensive workloads (an interactive web search application, an in-memory key -- value store, and a graph mining framework) to identify new insights into the nature of application memory errorvulnerability. Third, based on our insights, we propose several new hardware/software heterogeneous-reliability memory system designs to lower datacenter cost while achieving high reliability and discuss their trade-off. We show that our new techniques can reduce server hardware cost by 4.7% while achieving 99.90% single server availability.





Published In

Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014, 467-478.