Single-use JIT Performance on x86 Processors

栏目: IT技术 · 发布时间: 5年前

内容简介：Suppose you have the following code:This is essentially a JIT where the code is only used once, before being overwritten and used again. (Note, this code will not work on an OS which enforcesx86 processors fully support

Single-use JIT Performance on x86 Processors

Suppose you have the following code:

mem_page = allocate_memory_page(permissions = read+write+execute)
begin loop
	write_1kb_function_code_to(mem_page)
	mem_page() ; execute written code
end loop

This is essentially a JIT where the code is only used once, before being overwritten and used again. (Note, this code will not work on an OS which enforces W^X security. Currently, the most widely used OSes on x86 do not enforce W^X, hence the above technique is viable on most machines. For this investigation, I will assume no W^X is being enforced)

x86 processors fully support self-modifying code (SMC), without needing the program to signal anything to the processor (e.g. no flush instruction necessary), which means that x86 processors need to automatically detect writes to instruction data and handle them immediately.

The above code, although not really a true SMC scenario, is often detected as such, due to the JIT code residing in instruction cache when it needs to be written to. Such writes usually involve invalidating caches and flushing pipelines, which causes severe performance penalties.

I came across this issue whilst investigating an algorithm like the above for performing GF(2 ¹⁶ ) multiplication , and noticed that there are techniques that can be done to mitigate this performance penalty.

However, I’ve been unable to find much information on this, and even less in terms of actual test results. Hence, this repository hosts some test code, and results from experimentation, which intends to investigate ways in which to reduce the overhead of such code.

Test Summary

I’ve tried to think up a number of scenarios which might have an impact on the above behaviour, and implemented these into this test application.

It’s best to look at the code to see the tests that are run, but a summary of ideas tried is listed here:

obvious approach - that is, just write the code an execute it, as shown at the top of this page (labelled jit_plain in results). Also tests writing the code in reverse to see if that has any effect (labelled jit_reverse ).
for reference: JIT-to and execute separate locations. This doesn’t achieve the aim above, since it isn’t actually executing the JIT’d code, but is used to demonstrate the ideal case of having no penalty (labelled jit_only ), as well as compute the SMC penalty on processors
write code to non-executable memory (temporary location), then copy it to the allocated executable memory, using a range of techniques including memcpy (labelled jit_memcpy* ). If the memory copy means that fewer write operations hit the executable memory, it may mean fewer syncs/flushes are needed
‘clear’ executable memory before writing code, i.e. overwrite the region with null bytes, using memset , or take a shortcut by writing once to each 64-byte cacheline (labelled jit_clr* ). Maybe this helps invalidate the cache quickly before the JIT process would’ve done it slowly?
flushing cachelines before/after writing code (labelled jit_clflush* ). Maybe this helps push the data out of the separate L1 instruction/data cache ahead of time?
prefetching code to L2 cache, with/without write hinting (labelled jit_prefetch* ). L1 caches separate data and instructions, but L2 is shared, so any changes to instruction data must go through L2 before arriving at the instruction cache. Prefetching is usually used to promote data to a lower level of cache, not demote to a higher level, but maybe some processors can take note of the write hint and act accordingly
leaving a UD2 instruction at the beginning of the executable code, until the JIT process is complete (labelled jit_ud2 ). Maybe this helps with preventing the processor from prefetching the code as it’s being written to
alternate between a number of pre-allocated pages (labelled jit_*region ). This might help by reducing the likelihood that the JIT writing touches memory that is cached for execution, however, this is fairly expensive in terms of cache usage
trying to clear the instruction cache by jumping across a large number of cachelines (labelled jit_jmp* ). Probably thrashes the instruction cache however
seeing whether applying a memory fence ( MFENCE instruction) or serializing operation ( CPUID instruction) between writing and executing the code, makes any difference (labelled jit_mfence and jit_serialize )
writing to and executing from different virtual addresses which map to the same physical page (labelled jit_dual_mapping ). The Intel manual suggests that such behaviour requires a serializing instruction before executing the written code, which presumably means that code can be written without invoking self-modifying code behaviour

Note that the JIT routine itself writes out instructions one-by-one, and doesn’t attempt to batch multiple instructions into a single write.

Testing performance is done using the RDTSC instruction . To reduce variability, it is measured across multiple iterations, and multiple trials are performed to find the fastest measure.

Observations

The results, listed below, vary significantly across processor micro-architectures. Tests were performed on a variety of systems, including some on virtual machines. All code was compiled with GCC for x86-64.

Intel Core

Alternating between a number of regions, enough to exceed L1 cache size, appears to be the most effective technique ( jit_16region ), however this could affect the performance of other code.
Clearing the target before writing code, is second most effective (slightly worse than above), particularly if you just clear one byte per cacheline ( jit_clr_1byte )
CLFLUSHOPT is effective on Skylake/X ( jit_clflushopt ). Pre-Skylake, copying is also effective ( jit_memcpy* )
Clearing the instruction cache with jumps touching all the cachelines, seems to also be effective on Haswell and later ( jit_jmp32k_unalign ). This may affect performance of other code as it effectively flushes out the instruction cache

Raw results: Nehalem , [Sandy Bridge](results/Sandy Bridge.txt), [Ivy Bridge](results/Ivy Bridge (win64).txt), Haswell , [Broadwell](results/Broadwell (xen).txt), Skylake , [Skylake-X](results/Skylake-X (win64).txt)

Intel Atom

Silvermont ( raw result )
- Copying appears to be the most effective, particularly if non-temporal writes are used in the copy ( jit_memcpy_sse2_nt )
- Clearing the region seems effective too. Only one byte per cacheline is sufficient ( jit_clr_1byte )
- Issuing CLFLUSH ( jit_clflush ) or PREFETCHW ( jit_prefetchw ) over the region also seems to work (I thought Silvermont didn’t support PREFETCHW instruction added in Broadwell, but as it turns out, it supports the PREFETCHW instruction from 3DNow!)
Goldmont ([raw result](results/Goldmont (win64).txt))
- Copying ( jit_memcpy* ) and clearing ( jit_clr* ) are both effective
- The most effective techniques appear to be copying with non-temporal writes ( jit_memcpy_sse2_nt ), clearing the memory using non-temporal writes ( jit_clr_sse2_nt ), or issuing CLFLUSHOPT over the executable region before writing ( jit_clflushopt )

Intel Core2

Copying with non-temporal writes ( jit_memcpy_sse2_nt ) seems to have a slight advantage over the straightforward method

( raw result )

AMD Family 17h

Zen1 ([raw result](results/Zen1 (VM).txt))
- Copying seems to be the only effective method ( jit_memcpy* ). Unlike other CPUs here, non-temporal writes perform worse
- Unlike on Intel Core CPUs, where overwriting the instruction code seems to invalidate the data from the instruction cache, it seems like Zen1 maintains some sync between the two. This means that repeatedly overwriting the same code, that exists in instruction cache, can have a slowdown effect, even if the code is never executed later

AMD Family 16h

Jaguar ([raw result](results/Jaguar (win64).txt))
- None of the techniques tested seem to improve upon the plain case

AMD Family 15h

Piledriver ( raw result )
- Copying ( jit_memcpy* ) and clearing ( jit_clr* ) are both effective, as well as issuing PREFETCHW instructions ( jit_prefetchw )
- The most effective appears to be copying with non-temporal writes ( jit_memcpy_sse2_nt )

AMD K10

Copying is effective ( jit_memcpy* ), particularly with non-temporal writes ( jit_memcpy_sse2_nt )

( raw result )

All Processors

Although, according to Intel documentation, x86 processors don’t need to sync code/instruction data if they are at different virtual addresses, I found that processors did this anyway, if these point to the same physical address. I found that, even removing the synchronization instruction, code would still work
Not shown in the results here, but I did test with writing 4KB of code instead of 1KB of code, and found that the measured time also roughly quadrupled. This suggests that the SMC penalty is proportional to the amount of code written, so the idea of writing multiple functions before executing any, doesn’t seem to help much (unless it’s enough to exceed the L1 instruction cache and the processor is an Intel Core processor)

To make this page a little interesting than just text, here’s a chart to show you something visual, comparing the SMC penalty with the obvious solution and the best solution indicated above:

Single-use JIT Performance on x86 Processors

Note that there’s significant variation between the processors/systems tested, so measurements between processors probably shouldn’t be compared, but it can nonetheless show how much of an improvement can be made with techniques found above.

Summary

The performance of the example shown at the top of this document can be improved, often by a fair margin, by applying some techniques listed above. Unfortunately, there is no single ‘best solution’, as it varies depending on the underlying micro-architecture.

From the results I have (and speculation on my part for results I don’t have) the best solution is:

If CPU is Intel Core (Nehalem or later):
- Alternate between areas of memory (enough to be larger than L1 cache size), i.e. JIT to location 1, execute location 1, JIT to location 2, execute location 2 etc, and loop around ( jit_16region )
- If the above is not possible, or the cache hit is unacceptable, clear 1 byte per 64-byte cacheline in the area to be written to, before actually writing instructions ( jit_clr_1byte )
Else if CPU is Intel Atom, Intel Core2, AMD Family 15h or AMD K10:
- Write code to a temporary location, then copy to the destination using non-temporal writes ( jit_memcpy_sse2_nt ). SSE1/2 writes are sufficient, as none of these CPUs have 256-bit load/store ports
Else if CPU is AMD Family 17h, 18h or newer AMD:
- Write code to a temporary location, then copy to the destination ( REP MOVS seems to be optimal ( jit_memcpy_movsb ), but memcpy probably works fine too ( jit_memcpy ))
Otherwise (AMD Family 16h and unknowns):
- Just write directly to the destination ( jit_plain )

Probably not emphasised enough, but I do stress that the results here are specific to the test case examined here. If you implement a single-use JIT algorithm, results may vary depending on how much code is written and the way it is written (e.g. fewer write operations will perform better than more numerous smaller writes).

Further Experimentation

Tests on processors not listed here are welcome, as are any other findings, corrections and suggestions.

Compiling/Running the Test Application

The code currently does not implement dynamic runtime dispatch, so should be compiled using GCC/Clang with the -march=native flag on the machine running the code. The following command can be used to compile this test:

cc -g -std=gnu99 -O3 -march=native -o test test.c jump.s

Note that you may need to also add -lrt to the end, on some Linux distros.

I’ve noticed significant variability in results when running the test. The code does try to cater for this, by running multiple trials and taking the fastest run, but it may be beneficial to set the CPU governor/power profile to Performance, and disabling turbo boost, before running the test. Note that I haven’t done this for any of the results though.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Single-use JIT Performance on x86 Processors

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Two Scoops of Django

Daniel Greenfeld、Audrey M. Roy / CreateSpace Independent Publishing Platform / 2013-4-16 / USD 29.95

Two Scoops of Django: Best Practices For Django 1.5 is chock-full of material that will help you with your Django projects. We'll introduce you to various tips, tricks, patterns, code snippets, and......一起来看看《Two Scoops of Django》这本书的介绍吧!

码农工具

Single-use JIT Performance on x86 Processors