Techniques for Shared Resource Management in Systems with Throughput Processors

Ausavarungnirun, Rachata

doi:10.1184/R1/6723512.v1

Techniques for Shared Resource Management in Systems with Through.pdf (4.85 MB)

Techniques for Shared Resource Management in Systems with Throughput Processors

thesis

posted on 2017-05-01, 00:00 authored by Rachata Ausavarungnirun

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime example of throughput processors that can deliver high performance for applications ranging from typical graphics applications to general-purpose data parallel (GPGPU) applications. However, this success has been accompa- nied by new performance bottlenecks throughout the memory hierarchy of GPU-based systems. This dissertation identifies and eliminates performance bottlenecks caused by major sources of interference throughout the memory hierarchy. Specifically, we provide an in-depth analysis of inter- and intra-application as well as inter- address-space interference that significantly degrade the performance and efficiency of GPU-based systems. To minimize such interference, we introduce changes to the memory hierarchy for systems with GPUs that allow the memory hierarchy to be aware of both CPU and GPU applications’ charac- teristics. We introduce mechanisms to dynamically analyze different applications’ characteristics and propose four major changes throughout the memory hierarchy. First, we introduce Memory Divergence Correction (MeDiC), a cache management mecha- nism that mitigates intra-application interference in GPGPU applications by allowing the shared L2 cache and the memory controller to be aware of the GPU’s warp-level memory divergence characteristics. MeDiC uses this warp-level memory divergence information to give more cache space and more memory bandwidth to warps that benefit most from utilizing such resources. Our evaluations show that MeDiC significantly outperforms multiple state-of-the-art caching policies proposed for GPUs. Second, we introduce the Staged Memory Scheduler (SMS), an application-aware CPU-GPU memory request scheduler that mitigates inter-application interference in heterogeneous CPU-GPU systems. SMS creates a fundamentally new approach to memory controller design that decouples the memory controller into three significantly simpler structures, each of which has a separate task, These structures operate together to greatly improve both system performance and fairness. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus on inter-application scheduling decisions. These two stages en- force high-level policies regarding performance and fairness. As a result, the last stage is simple logic that deals only with the low-level DRAM commands and timing. SMS is also configurable: it allows the system software to trade off between the quality of service provided to the CPU versus GPU applications. Our evaluations show that SMS not only reduces inter-application interference caused by the GPU, thereby improving heterogeneous system performance, but also provides better scalability and power efficiency compared to multiple state-of-the-art memory schedulers. Third, we redesign the GPU memory management unit to efficiently handle new problems caused by the massive address translation parallelism present in GPU computation units in multi- GPU-application environments. Running multiple GPGPU applications concurrently induces significant inter-core thrashing on the shared address translation/protection units; e.g., the shared Translation Lookaside Buffer (TLB), a new phenomenon that we call inter-address-space interference. To reduce this interference, we introduce Multi Address Space Concurrent Kernels (MASK). MASK introduces TLB-awareness throughout the GPU memory hierarchy and introduces TLBand cache-bypassing techniques to increase the effectiveness of a shared TLB. Finally, we introduce Mosaic, a hardware-software cooperative technique that further increases the effectiveness of TLB by modifying the memory allocation policy in the system software. Mosaic introduces a high-throughput method to support large pages in multi-GPU-application environments. The key idea is to ensure memory allocation preserve address space contiguity to allow pages to be coalesced without any data movements. Our evaluations show that the MASK-Mosaic combination provides a simple mechanism that eliminates the performance overhead of address translation in GPUs without significant changes to GPU hardware, thereby greatly improving GPU system performance. The key conclusion of this dissertation is that a combination of GPU-aware cache and memory management techniques can effectively mitigate the memory interference on current and future GPU-based systems as well as other types of throughput processors.

History

Date

2017-05-01

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Onur Mutlu

Usage metrics

Keywords

GPU Resource Management Throughput Processors

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Techniques for Shared Resource Management in Systems with Throughput Processors

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports