Checking Frame Damage on the GPU

栏目: IT技术 · 发布时间: 5年前

内容简介:Yesterday I finished some changes to myWhat is damage checking? You may ask. Damage checking is comparing a new frame (image) to the last frame to see which pixels have changed between them, i.e. which parts of the framebuffers are “damaged” and must be re

Yesterday I finished some changes to my VNC server that will offload some of the damage checking to the GPU. This doesn’t really improve performance much but it does put c.a. 10% less load on a single CPU core on my system. Generally, my CPU isn’t very loaded, except when I’m compiling code. I don’t play computer games, so I imagine that my GPU spends most of its time sleeping. It is probably also true of people who run VNC, that they don’t play video games that much, at least not while they’re running VNC. Video gaming over VNC is probably never going to take off. Why not put the GPU to some good use?

What is damage checking? You may ask. Damage checking is comparing a new frame (image) to the last frame to see which pixels have changed between them, i.e. which parts of the framebuffers are “damaged” and must be redrawn. This is done to reduce network traffic, but it is also useful in that it reduces the amount of memory that needs to be accessed by the encoding algorithm. Here’s some pseudo code to express this in its simplest form:

for (y = 0; y < height; y++)
	for (x = 0; x < width; y++)
		if (old[x, y] != new[x, y])
			mark_damaged(x, y)

As you can probably imagine, doing this on the CPU is not going to be cheap. For the average modern monitor, the memory that this is going to touch will be at least 1920 ⋅ 1080 ⋅ 4⋅ 2 ≈ 16MB! This number does not include the buffer to store the damage into. If it is stored in a bitmap, it will take 1920 ⋅ 1080 / 8 ≈ 250kB, which is not so bad, but probably precludes any SIMD magic taking place in this algorithm. Doing something like damage[x, y] = old[x, y] != new[x, y] might be auto-vectorized into something decent if damage is implemented as a byte array, but not bit map, as that would require state to be carried from previous iterations.

When the damage has been found, we must find some contiguous regions within it, so that those regions can be encoded and sent to the client. For this I use pixman, which is a pixel manipulation library. The function that I use to mark damaged regions ( pixman_region_union_rect ) does non-trivial amount of work including memory allocations. Implementing mark_damaged() in the example above using pixman_region_union_rect(&damage, &damage, x, y, 1, 1) is not a good idea.

Knowing the damage on per-pixel basis isn’t actually very useful, nor is it practical, as we’ve seen. It is even detrimental to the encoding efficiency to have too fine-grained regions since encoding each distinct region carries some overhead. One approach, which is what I use, is to split the image into tiles. Each tile is a 32×32 pixel region. There is nothing scientific about this number really; it’s just something that works well enough. Now a simple algorithm may look like this:

for (y = 0; y < height; y += 32)
	for (x = 0; x < width; x += 32)
		if (frame1[x:x+32, y:y+32] != frame2[x:x+32, y:y+32])
			mark_damaged(x, y, 32, 32)

Something like this is what wayvnc has been doing via NeatVNC for a while now. See damage.c .

It is probable that the naive algorithm above would actually perform better than the tiled one given a more suitable mark_damaged() function combined with one that turns the resulting bit map or byte map into pixman_region . I have not explored this as there are other ventures far likelier to yield better results. Both approaches are pretty poor when it comes to conserving memory bandwidth. We still have to check all the pixels. That doesn’t change.

Wayvnc can do frame capturing via Linux DMA-BUFs. In short, they are resources that are represented by file descriptors that can be passed between processes. A GPU memory region can be represented by such an entity, so they can be sent from the Wayland compositor to the VNC server. Copying things from the GPU is pretty expensive too, as it requires the CPU to reach into the GPU’s memory and grab 8MB of data. This happens in the compositor when you use the “wlr-screencopy” protocol. It writes the resulting frame into shared memory, but slows down the compositor while it does so. It is better to have the client do the copying because this leaves the compositor free to do other things.

Because the frames are already on the GPU when they arrive, it definitely pays off to do some pre-processing on the GPU before passing the data on to the CPU. In fact, the simple naive damage checking algorithm can be implemented in GLSL shader language like this:

precision mediump float;

uniform sampler2D u_tex0;
uniform sampler2D u_tex1;

varying vec2 v_tex_coord;

void main()
{
	float r = float(texture2D(u_tex0, v_tex_coord).rgb != texture2D(u_tex1, v_tex_coord).rgb);
	gl_FragColor = vec4(r);
}

And if it is rendered into a single channel framebuffer object (i.e. only the red component), the memory that needs to be copied to get it to the CPU will be one quarter of the memory required to copy a whole buffer.

Now, the actual image also needs to be copied whole, or does it? In OpenGL ES 2.0, pixel data can be copied using the glReadPixels() function. It is limited in that only the height and the location on the vertical axis may be varied when selecting a region within the frame to copy, but the width must always be the same as the width of the source buffer. With this in mind, it is actually possible to make some crude adjustments. Because the damage has already been rendered and copied, that information can be used to derive which parts of the y-axis have been damaged, so one trick that can save a lot of copying is to just copy a ribbon region across the screen that contains all the damage. This is how it is currently done in wayvnc. This helps when there are small changes, but for whole-screen changes or video, it makes no difference.

Even though the damage frame is now only 2MB, cycling through it still shows up close to the top when running perf record & perf report . It’s nowhere near as inefficient as before, but it’s still up there. What more can be done? Well, why don’t we let the GPU handle the tiling for us? A simplified version of the shader might look like this:

precision mediump float;

uniform sampler2D u_tex0;
uniform sampler2D u_tex1;

uniform vec2 u_tex_size;

varying vec2 v_tex_coord;

bool is_pixel_damaged(vec2 pos)
{
	return texture2D(u_tex0, pos).rgb != texture2D(u_tex1, pos).rgb;
}

bool is_region_damaged(vec2 pos)
{
	bool r = false;

	for (int y = -16; y < 16; ++y)
		for (int x = -16; x < 16; ++x) {
			float px = float(x) + v_tex_coord.x;
			float py = float(y) + v_tex_coord.y;

			if (is_pixel_damaged(vec2(px, py) / u_tex_size))
				r = true;
		}

	return r;
}

void main()
{
	float r = float(is_region_damaged(v_texture));
	gl_FragColor = vec4(r);
}

This can be sampled into a much smaller framebuffer object: one that is ⌈1920 ⋅ 1080 / 32 / 32⌉ = 1922 bytes. That’s a size not even worth worrying about. And as expected, neither copying it to CPU nor going through it shows up in perf . This last change hasn’t made it into wayvnc yet, but it will be there as soon as I clean up the changes.

I’ll leave benchmarking as an exercise for the reader. Have fun!

There will be a second blog post soon on this subject where I go into more details regarding the shaders and maybe I’ll do some benchmarking. Stay tuned!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

怎样解题

怎样解题

[美] G. 波利亚 / 涂泓、冯承天 / 上海科技教育出版社 / 2002-6 / 16.00元

《怎样解题:数学教学法的新面貌》是数学家波利亚论述中学数学教学法的普及名著,对数学教育产生了深刻的影响。波利亚认为中学数学教育的根本宗旨是教会年轻人思考,他把“解题”作为培养学生数学才能和教会他们思考的一种手段和途径。这本书是他专门研究解题的思维过程后的结晶。全书的核心是他分解解题的思维过程得到的一张“怎样解题”表。作者在书中引导学生按照“表”中的问题和建议思考问题,探索解题途径,进而逐步掌握解题......一起来看看 《怎样解题》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试