Prefix Sum on Vulkan

栏目: IT技术 · 发布时间: 5年前

内容简介：Today, there are two main ways to run compute workloads on GPU. One is CUDA, which has a fantastic ecosystem including highly tuned libraries, but is (in practice) tied to Nvidia hardware. The other is graphics APIs used primarily for gaming, which run on

Today, there are two main ways to run compute workloads on GPU. One is CUDA, which has a fantastic ecosystem including highly tuned libraries, but is (in practice) tied to Nvidia hardware. The other is graphics APIs used primarily for gaming, which run on a wide variety of hardware, but historically offer much less power than CUDA. Also, the tooling for compute in that space is terrible.

Vulkan has been catching up fast in its raw capabilities, with recent extensions supporting more advanced GPU compute features such as subgroups, pointers, and a memory model. Is it getting to the point where it can run serious compute workloads?

In this blog post are some initial explorations into implementing prefix sum on recent Vulkan. I have a rough first draft implementation which suggests that Vulkan might be a viable platform, for a sufficiently persistent implementor.

Why prefix sum?

As Hacker News user fluffything points out in this HN thread on my Taste of GPU compute talk, prefix sum is an excellent benchmark for evaluating GPU compute languages and runtimes.

For one, it is useful in and of itself. I use it in font-rs to integrate fragments of exact-area computations to arrive at the total area coverage for font rendering. It is also used as a primitive in many more operations, including GPU-side dynamic allocation and compaction .

For two, it is simple. The sequential version can be expressed in just a handful of lines of code:

def prefix_sum(a):
    s = 0
    result = []
    for x in a:
        s += x
        result.append(s)
    return result

For three, it is challenging but possible to implement efficiently on GPU. The above code has a strictly sequential dependency, but because addition is associative, it is possible to exploit a great deal of parallelism, and there is literature on that going back decades. Even so, efficiently exploiting that parallelism on GPU requires communication between invocations (“threads” in more common GPU lingo) and careful attention to the memory hierarchy.

The generalization of prefix sum is called “scan,” and works with any associative operation, not just addition. It doesn’t even have to be commutative; examples of that include regular expressions andIIR filtering. More precisely, a scan can be done with any monoid , a structure with an identity element as well as the associative operation; the identity element is required for the “exclusive” variant of scan, as it is the first element of the output.

Implementation on GPU

The state of the art is decoupled look-back . I’m not going to try to summarize the algorithm here, but recommend reading the paper. The results are impressive — for large data sets, they report reaching memcpy speeds, meaning that no further speedup is possible.

That work is a refinement of Parallel Prefix Sum (Scan) with CUDA from Nvidia’s GPU Gems 3 book. A production-quality, open source implementation isCUB. Another implementation, designed to be more accessible but not as optimized, isModernGPU scan.

My own implementation is very much a research-quality proof of concept. It exists as the prefix branch of the piet-gpu repository. Basically, I wanted to determine whether it was possible to come within a stone’s throw of memcpy performance using Vulkan compute kernels. It’s a fairly straightforward implementation of the decoupled look-back paper, and doesn’t implement all the tricks. For example, the look-back is entirely sequential; I didn’t parallelize the look-back as suggested in section 4.4 of the paper. This is probably the easiest performance win to be gotten. But it’s not too horrible, as the partition size is quite big; each workgroup processes 16ki elements. Rough measurements indicate that look-back is on the order of 10-15% of the total time.

The implementation is enough of a rough prototype I don’t yet want to do careful performance evaluation, but initial results are encouraging: it takes 2.05ms of GPU time to compute the prefix sum of 64Mi 32-bit unsigned integers on a GTX 1080, a rate of 31.2 billion elements/second. Since each element involves reading and writing 4 bytes, that corresponds to a raw memory bandwidth of around 262GiB/s. The theoretical memory bandwidth is listed as 320GB/s, so clearly the code is able consume a large fraction of available memory bandwidth.

Do we need a memory model?

One of the achievements of “modern C++” is the C++11 memory model. Before then, the mechanism for lock-free programming patterns was the volatile qualifier and various nonstandard barrier intrinsics. People reasoned about these operationally — the primary function of volatile was to disable certain optimizations, and the barrier intrinsics compile to a memory fence instruction, which generally cause hardware to flush caches.

Today, most lock-free aficionados consider those times to be barbaric. The semantics of volatile were never clearly defined, and the barrier instructions had the disturbing property of being hardware specific. Because x86 has “total store order,” barrier instructions are generally not needed for publication safety . However, the same code on, say, ARM, which has more weakly ordered memory semantics, would fail, often in subtle ways.

With the C++11 memory model, the programmer specifies the needed ordering constraints precisely. The compiler can then optimize the program very aggressively, as long as it meets those constraints. For example, acquire and release semantics (the basis of publication safety) will compile to explicit memory fence instructions on ARM, but to nothing on x86. A good writeup is the blog post C++ atomics and memory ordering .

The new Vulkan memory model brings the same idea to GPU compute. I used it in my code, in large part because I wanted to experiment with it. I’ve done a fair amount of lock-free code using the C++ memory model. And lock-free code, while fairly rare on the CPU (my main motivation is to avoid priority inversion for real time audio), is more or less required on the GPU, because mutex is not available in kernel code. Even if it were, it would create a lot of problems, as it would block the entire subgroup, not just a single thread (one of the features of the Vulkan memory model is a much weaker forward progress guarantee than threads running on CPU).

Is a memory model absolutely required to run this code? If you replace the atomic loads and stores with simple array accesses, it deadlocks. However, at least on my hardware, correct operation can be recovered by adding the volatile qualifier to the WorkBuf array. As with older style C++, there are two risks. Though it seems to work reliably and efficiently on my hardware, it’s possible the volatile qualifier and explicit fences cause more cache flushing than is needed, or suppress other optimizations that might be possible with a more precise expression of the memory semantics. Alternatively, other hardware or drivers might optimize even more aggressively and break the code.

We’re already seeing variation in hardware that requires different levels of vigilance for memory semantics. On most GPU hardware, the invocations (threads) within a subgroup (warp) execute in lock-step, and thus don’t require any synchronization. However, as of Nvidia Volta, the hardware is capable of independent thread scheduling . Correct code will add explicit memory semantics even within a subgroup, which will, as in total store order on x86, compile to nothing on hardware that runs invocations in lockstep. (It’s not obvious to me yet that the capabilities of Vulkan, even with the subgroup and memory model extensions, have the same power to generate code optimized for independent thread scheduling as, say, the __shfl_sync intrinsic, as the Vulkan subgroup operations don’t take a mask argument. Maybe someone who knows can illuminate me.)

In my research for this blog post, I did not come across any evidence of people actually using the Vulkan memory model, i.e. no search hits for the relevant identifiers other than work associated with the spec. Thus, one contribution of this blog post is to show a concrete example of code that uses it.

Dynamic allocation on GPU

On GPU, it’s easiest to run workloads that use static allocation, for example a fixed size buffer per workgroup, and workgroups arranged in a 2D grid (“dispatch” operations support 1D and 3D as well). But dynamic allocation is possible, with care.

The two major approaches to dynamic allocation are prefix sum and atomic bump allocation. The main reason for one over the other is whether you care about the ordering. Let’s take a simple problem of computing some function on an array of input values, where the output is variable sized.

Using a prefix-sum approach, you run a first pass of computing the size of the output. The prefix sum of that result yields an offset into an output buffer. The second pass (after the prefix sum) computes the function and writes it into the output buffer, using the offset provided by the prefix sum. [Also note that if we’re getting really fancy, it might be possible to fuse either or both of these passes with the prefix sum itself, decreasing the amount of global memory traffic but increasing register pressure and otherwise constraining efficient use of the memory hierarchy, so the extent to which this helps depends greatly on the exact problem].

An atomic bump allocation approach simply does atomicAdd on each output, using a bump allocation index (effectively a pointer) as the first argument and the size of the allocation as the second. This yields results broadly similar to the prefix sum approach, but with the outputs in arbitrary order. Perhaps the order is not important, or, alternatively, a sort pass can be applied afterwards (sorting on GPU is another topic with a rich literature).

The two can be combined. For example, it makes sense to do a prefix sum of the sizes of items within a workgroup, and a single atomic bump allocation for the per-workgroup total.

One problem that might benefit from prefix sum for dynamic allocation isflattening Bézier curves to polylines. Each Bézier segment can be computed in parallel, but you generally want to preserve the order of segments within the full path. The flattening algorithm I presented in that blog post (and its generalization to cubics ) fits nicely into this framework — it’s already in two passes, where the first computes the number of segments required, and the second can compute the coordinates of each point in the output independently, thus in parallel.

Subgroups and subgroup size

High performance prefix sum requires coordination between threads — it’s possible to extract some parallelism by running O(log n) tree reduction passes, each of which pulls only from the previous pass, but this would be considerably slower than state of the art. Coordination must be at all levels of the hierarchy. GPU compute has always made threadgroup shared memory available for such coordination. An even faster but newer capability is subgroups , not yet universally supported.

My prototype code uses subgroups extensively. One serious limitation is that it assumes a subgroup size of 32, which is true for some hardware. However, other hardware has different size subgroups, and then Intel is special.

By default, when compiling a compute kernel, the Intel drivers use a heuristic to determine the subgroup size, which can then be 8, 16, or 32. It actually makes sense they use a heuristic, as there’s a complex tradeoff. A bigger subgroup means bigger chunks of work, which means less per-chunk overhead, but also fewer registers available per thread, and potentially more wasted work due to divergence. Again, that depends on workloads. For low-probability, expensive conditional work, generally not a good fit for GPU but sometimes unavoidable, wasted work tends to scale with subgroup size.

It might be possible to write a kernel that adapts to subgroup size, but there are a number of considerations that make this tricky. One is whether the number of items processed by a workgroup adapts to subgroup size. If so, then the size of the dispatch must be adapted as well. There is an extension for the CPU side to query subgroup size of a pipeline, but, sadly, it doesn’t seem to be implemented on Intel drivers on Windows, where it would be most useful. (It is, thankfully, in the latest Linux Intel drivers, so hopefully will be coming soon.)

Another problem is querying the subgroup size from inside the kernel, which has a surprising gotcha. Unless the VK_PIPELINE_SHADER_STAGE_CREATE_ALLOW_VARYING_SUBGROUP_SIZE_BIT_EXT flag is set at pipeline creation time, the gl_SubgroupSize variable is defined to have the value from VkPhysicalDeviceSubgroupProperties , which in my experiment is always 32 on Intel no matter the actual subgroup size. But setting that flag makes it give the value expected.

Newer (Vulkan 1.2) Intel driver offer finer control over the subgroup size, with the VK_EXT_subgroup_size_control extension. With that, I can set the subgroup size to 32, and the kernel works fine. Note though that in general, setting a too-large subgroup size can actually make performance worse, as it increases the chance of register spilling.

In practice, the programmer will write multiple versions of the kernel, each tuned for a different subgroup size, then on CPU side the code will query the hardware for supported subgroup sizes and choose the best one that can run on the hardware. Note that, in general, querying the range of supported subgroup sizes requires the subgroup size extension to be reliable, though you do string-matching on the device name to come up with a good guess. In any case, the cost and difficulty of this kind of performance tuning is one reason Nvidia has such a strong first-mover advantage.

Brian Merchant has done more exploration into the tradeoff between subgroups and threadgroup shared memory, for a different primitive operation, transpose of 32x32 boolean matrices. That transpose timing writeup contains measurements on a variety of hardware, and is recommended to the interested reader.

What does subgroupInclusiveAdd compile to?

The subgroupInclusiveAdd function seems like it’s doing a lot — it’s performing a prefix sum operation on an entire subgroup’s worth of data. Does hardware contain an assembly instruction that directly implements it? What if you want to do an operation other than addition, where there isn’t an intrinsic available?

Obviously different hardware will be different, but looking at the Radeon GPU Analyzer output on Shader Playground tells us a lot. It generates a tree reduction (the Hillis-Steele algorithm as presented in the prefix sum Wikipedia page) with lg(n) stages of subgroup shuffle + add. Since subgroup shuffle is available in Vulkan (but see below), if you were to write out such a reduction you’d be able to get similar results.

On AMD hardware there is one additional twist: AMD has an additional level of hierarchy between subgroup (64 invocations, 1 wavefront) and invocation (thread). Internally, the hardware is organized around a row of 16 elements. Access to elements within a row uses a different instruction modifier ( row_shr ) than across the entire wave ( row_bcast or wave_ror , for two examples), and is likely lower latency in the chip. The Vulkan subgroup extensions provide a powerful and portable set of operations, but don’t expose all of the lowest-level operations available on the GPU hardware. To squeeze the last few percent of performance, assembly is still useful.

Portability considerations: DX12

It is tempting to use a portability layer such as gfx-hal to run compute workloads on a variety of graphics APIs. (Other such portability layers include MoltenVK for running Vulkan on top of Metal, and similar work for running OpenCL on DX12 ). But such an approach is limited to the lowest common denominator — it can’t provide capabilities that are missing in the underlying layer.

Here are some of the pain points for DX12:

No subgroup size control.
No subgroup shuffle operations — use threadgroup shared memory instead.
No memory model — use volatile and explicit barriers instead.
No pointers (not particularly useful for this workload, but important for others).

Also note that gfx-hal currently doesn’t give access to Shader Model 6 intrinsics (subgroup operations), but there’s an issue and hopefully that will be fixed.

Portability considerations: Metal

Metal is closer to Vulkan in capabilities (especially newer versions), but still lacks subgroup size control and a memory model.

A challenge for GPU compute infrastructure

I covered a fair number of GPU compute infrastructure projects in my talk and the associatedGPU resources list. Since then I’ve learned of quite a few more:

vuda , which promises to run CUDA workloads on Vulkan.
clvk and clspv , which run OpenCL workloads on Vulkan.
OpenCL 3.0 is announced, with a number of strategies to rescue OpenCL from a fate of irrelevance.
oneAPI , which offers a CUDA migration path but aspires to being a portable standard.

I am also optimistic about WebGPU becoming a viable platform for compute workloads, both delivered over the web and in native implementations such as wgpu .

Echoing fluffything’s comment, I propose adopting prefix sum as something of a “hello world” benchmark of GPU compute. It’s simple enough it should be practical to implement without too much effort (and if not, that’s also an important data point), it exercises “advanced” features such as subgroup shuffles, and it’s reasonably easy to quantify. When looking at these potential infrastructure projects, ask these questions:

How close can it get to the performance offered by the hardware?
How portable is the high-performance result?
Are there ways to smoothly downgrade on less capable platforms?

The result of my explorations on Vulkan suggest (but do not yet prove) good answers to these questions, but at the expense of doing a lot of the low-level legwork yourself, and programming the kernel in a very low-level style (in GLSL). I think there’s a huge opportunity for more sophisticated tools.

Also, I think it’s a great benchmark for the emerging field of GPU-friendly languages. Is it possible to express the algorithm in a reasonably high-level manner? If so, does it compile to code with competitive performance? Can we write a high-performance abstraction as a library that can be consumed easily? Can that abstraction offer portability across hardware but hide the complexity from its users? Can you provide your own monoid?

Conclusion

I’ve showed that Vulkan can do prefix sum with near state of the art performance. However, I’ve also outlined some of the challenges involved in writing Vulkan compute kernels that run portably and with high performance. The lower levels of the stack are becoming solid, enabling a determined programmer to ship high performance compute across a wide range of the hardware, but there is also an opportunity for much better tooling at the higher levels. I see a bright future ahead for this approach, as the performance of GPU compute is potentially massive compared with CPU-bound approaches.

Thanks to Brian Merchant, Matt Keeter, and msiglreith for discussions on these topics, and Jason Ekstrand for setting me straight on subgroup size concerns on Intel.

以上所述就是小编给大家介绍的《Prefix Sum on Vulkan》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Prefix Sum on Vulkan

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

离散数学及其应用（原书第7版）

Kenneth H. Rosen / 徐六通、杨娟、吴斌 / 机械工业出版社 / 2015-1-1 / 129

《计算机科学丛书：离散数学及其应用（原书第7版）》是介绍离散数学理论和方法的经典教材，已经成为采用率最高的离散数学教材，被美国众多名校用作教材，获得了极大的成功。中文版也已被国内大学广泛采用为教材。作者参考使用教师和学生的反馈，并结合自身对教育的洞察，对第7版做了大量的改进，使其成为更有效的教学工具。《计算机科学丛书：离散数学及其应用（原书第7版）》可作为1至2个学期的离散数学课入门教材，适用于数......一起来看看《离散数学及其应用（原书第7版）》这本书的介绍吧!

码农工具