View on GitHub ROCm_Logo

ROCm, A New Era in Open GPU Computing

Platform for GPU Enabled HPC and UltraScale Computing

ROCm with Rapid Harmony : Optimizing HSA Dispatch

We previously looked at how to launch an OpenCL™ kernel using the HSA runtime. That example showed the basics of using the HSA Runtime. Here we'll turn up the tempo a bit by optimizing the launch code - moving some expensive operations into the setup code (rather than on each dispatch), removing host-side synchronization, and optimizing the memory fences to the bare minimum required.  We'll measure the contributions of the different optimizations and discuss the results.   The code is available at the same GitHub repository as before and the optimizations can be enabled with a series of command-line switches.

Optimizing

Bitonic sort involves running the same kernel several times. For the default array length of 32768, the algorithm launches 120 kernels.  The original OpenCL code and the associated port used in the example synchronize with the host after each of the kernel code.  To improve performance, we can submit all 120 kernels at one time, and only synchronize with the host after the last one completes. To make this change, we will need to restructure the BitonicSort::run call as follows:

Results

After making these changes, we see the speedups shown in the chart and table below.

perf-data

The timing numbers shown here includes the time to transfer the array to the GPU, run all of the kernels, and transfer back the result.  The numbers do not include time spent initializing the runtime, allocating memory, or performing the result verification.  The times show the time required to sort 32768 integers using 120 kernels.  This is relatively small size to offload to the GPU (only 128K) and as a result the kernels run in 3-4 us, which stresses the HSA runtime features that we want to discuss.

Baseline +optPreallocSignal +optPreallocKernarg +optAvoidHostSync +optFence +optPinnedHost
RunTime/Iteration (us) 1943 1906 1869 1665 1221 1137
Delta/Iteration(us) -37 -37 -204 -444 -84

The results show that signal allocation and kernarg allocation both take approximately 37us to complete, which makes sense since both operations require a trip into kernel space (ROCK) and perform memory allocation.  Even the baseline operation shares the signal and kernarg allocation for all 120 kernels but the overhead here is still significant.  Kernels can be dispatched in 5-10us each, so optimal programs definitely will want to perform these operations outside of the critical path.  The optimized code path here moves these operations into the setup routine.  Another option is to create a buffer pool of signals and kernargs (this is the approach used by HCC) or to use thread-local-storage (if thread-safety is required).

Avoiding the host synchronization saves 204us, or about 1.7us per kernel.

The system-scope fences are fairly expensive - Fiji has a 2MB L2 cache, and it takes 3-4 us to flush the entire thing.  Additionally, the bitonic kernel default size is only 128K (32K elements * 4 bytes/element)  which can easily fit in the L2 cache.  Each kernel in the sequence then reads from the L2 and writes the data back to it.  By optimizing these fences to use AGENT scope when possible, we are able to save approximately 3.7us per kernel launch.

Finally, using pinned host memory improves the transfer speeds from around 6GB/s to 14GB/s.    In this workload, we see a modest performance improvement (84us) since most of the benchmark is spent running the kernels and synchronizing between them.

Overall the performance improvement from these optimizations is 1.7X faster than the baseline version.

 

Reference:

Wikipedia has a nice description of the Bitonic sort algorithm, including pictures. Eric Bainville wrote a nice explanation here describing how to optimize Bitonic Sort for the GPU.