Ana səhifə

Exploring the Limits of gpgpu scheduling In Control Flow Bound Applications


Yüklə 127.89 Kb.
səhifə1/6
tarix01.05.2016
ölçüsü127.89 Kb.
  1   2   3   4   5   6
Exploring the Limits of GPGPU Scheduling

In Control Flow Bound Applications

Roman Malits, Avi Mendelson, Avinoam Kolodny, Evgeny Bolotin



{romanma@tx, kolodny@ee}.technion.ac.il , EE Department, Technion-IIT, Israel

evgeny.bolotin@intel.com , Intel Corporation, Haifa, Israel

avim@microsoft.com , Microsoft R&D Israel

Abstract

GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications’ e.g., matrix multiplication, GPGPU based system can achieve very high performance. However, many general purpose data parallel applications are characterized as having intensive control flow and unpredictable memory access patterns.

Optimizing the code in such problems for current hardware is often ineffective and even impractical since it exhibits low hardware utilization leading to relatively low performance.

This work tracks the root causes of execution inefficacies when running control flow intensive CUDA applications on NVIDIA GPGPU hardware.

We show both analytically and by simulations of various benchmarks that local thread scheduling has inherent limitations when dealing with applications that have high rate of branch divergence. To overcome those limitations we propose to use hierarchical warp scheduling and global warps reconstruction. We implement an ideal hierarchical warp scheduling mechanism we term ODGS (Oracle Dynamic Global Scheduling) designed to maximize machine utilization via global warp reconstruction. We show that in control flow bound applications that make no use of shared memory (1) there is still a substantial potential for performance improvement (2) we demonstrate, based on various synthetic and real benchmarks the feasible performance improvement.

For example, MUM and BFS are parallel graph algorithms suffering from significant branch divergence. We show that in those algorithms it’s possible to achieve performance gain of up to x4.4 and x2.6 relative to previously applied scheduling methods.



Keywords: multicore processing; performance analysis; parallel machines; scheduling algorithm

INTRODUCTION

GPGPU calls to extend the use of GPUs for general purpose applications. Accordingly, the graphics hardware was extended and new programming models such as CUDA[3] and OpenCL[19] were introduced and enabled the use of GPUs for wide verity of applications [17].

Current GPGPU programming languages, such as CUDA, expose GPU micro-architectural internals. This allows programmers to perform fine grained code level optimizations by explicitly specifying and hand-tuning available application thread level parallelism to the specific hardware.

Such optimizations allows to achieve high performance in a class of applications that can be expressed as massive parallel with very simple control structure and dependencies between the threads, since such applications naturally fit the current GPGPU design philosophy. However, for many other applications; e.g., those that have intensive control flow and unpredictable memory access patterns, optimizing the code for current GPGPU hardware is often impractical resulting in severe drop in the system utilization and lowered performance [1][2][4][6][7].

In this paper we explore the limitations and pin down the root causes of execution inefficacies in several classes of CUDA applications. GPGPU architectures assume that the software is highly parallel and can take advantage of two levels of parallelism; (1) TLP (task level parallelism), that assume that each phase of the execution can be divided into independent parallel parts and (2) inner thread parallelism that take advantage of the data parallelism within each task. While the TLP is mainly used to mitigate stalls caused by long latencies, usually related to memory access time, the inner thread parallelism is supported via the warp scheduling and SIMD-like hardware. In order to achieve best performance, one should carefully balance these two levels of parallelism. This task can be quite challenging in particular when threads within the task use complex control flow as in [11][18][21][22][23]. As indicated in [1][2][7], the scheduling mechanisms in GPU hardware such as NVIDIA G200 or even in Fermi incurs several inefficiencies associated with branch divergence and scheduling in the resolution of blocks.

Current hardware is built out of several streaming multiprocessors (SMs) and employs scheduling optimizations local to each SM. This approach is effective when control flow is simple but, as we will show later on, when dealing with applications that have high rate of branch divergence the use of local optimizations have inherent limitations.

In this work we analyze those scheduling related bottlenecks analytically and through simulations by using a modified version of the GPGPU-SIM [2] cycle accurate simulator. To optimize the performance of control flow bound applications we propose a novel approach based on hierarchical warp scheduling and global warp reconstruction. We implemented (in GPGPU-SIM) an ideal hierarchical scheduling mechanism we termed ODGS (Oracle Dynamic Global Scheduling) designed to maximize the amount of effective parallelism in the system through the use of global warp reconstruction.

We use ODGS to study the potential for performance improvement if global warp reconstruction is used. We show by simulations of synthetic and real benchmarks that there is still a substantial potential for performance improvement via a more sophisticated global scheduling approaches especially when targeting control flow bound applications that make no use of shared memory.

For example, MUM and BFS are parallel graph algorithms suffering from significant branch divergence. We show that in those algorithms it’s possible to achieve performance gain of up to x4.4 and x2.6 relative to previously applied scheduling methods in ideal conditions.

We perform a sensitivity analysis and show that implementing such a hierarchical global warp scheduling in real hardware is feasible. Our findings indicate that if future graphics will be built to allow global warps reconstruction and thread migration between SMs it may ease the optimization problem and allow to improve the performance of applications that otherwise would not be able to efficiently run in GPGPU environment, or at least will require spending long time at the software optimization phase.

Those findings manifest for the need for future research in the field of global thread scheduling and migration in GPGPU hardware.

The rest of the paper is organized as follows: section 2 presents an analytical study on the various phenomena that hamper performance of control flow intensive application programs running on GPGPU. Section 3 presents experimental study involving different scheduling methods and CUDA benchmarks. Section 4 discusses previous work and section 5 presents conclusions and future work.


  1   2   3   4   5   6


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©anasahife.org 2016
rəhbərliyinə müraciət