Date of Original Version
All Rights Reserved
Abstract or Description
Many programs initialize or copy large amounts of memory data. Initialization and copying are forms of memory operations that do not require computation in order to derive their data-values – they either deal with known data-values (e.g., initialize to zero) or simply move data-values that already exist elsewhere in memory (e.g., copy). Therefore, initialization/copying can potentially be performed entirely within the main memory subsystem without involving the processor or the DMA engines. Unfortunately, existing main memory subsystems are unable to take advantage of this fact. Instead, they unnecessarily transfer large amounts of data between main memory and the processor (or the DMA engine) – thereby consuming large amounts of latency, bandwidth, and energy.
In this paper, we make the key observation that DRAM chips – the predominant substrate for main memory – already have the capability to transfer large amounts of data within themselves. Internally, a DRAM chip consists of rows of bits and a row-buffer. To access data from any portion of a particular row, the DRAM chip transfers the entire row (e.g., 4 Kbits) into the equally-sized rowbuffer, and vice versa. While this internal data-transfer between a row and the row-buffer occurs in bulk (i.e., all 4 Kbits at once), an external data-transfer (to/from the processor) is severely serialized due to the very narrow width of the DRAM chip’s data-pins (e.g., 8 bits). Our key idea is to utilize and extend the row-granularity data-transfer in order to quickly initialize or move data one row at a time within a DRAM chip. We call this new mechanism RowClone. By making several relatively unintrusive changes to DRAM chip design (0.026% die-size increase), we accelerate a one-page (4 KByte) copying operation by 11.5x, and a one-page zeroing operation by 5.8x, while also conserving memory bandwidth. In addition, we achieve large energy reductions – 41.5x/74.4x energy reductions for one-page zeroing/copying, respectively. We show that RowClone improves performance on an 8-core system by 27% averaged across 64 copy/initialization-intensive workloads.