Latency implications of virtual memory

栏目: IT技术 · 发布时间: 5年前

内容简介：This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading

This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading systems this guide will be useful to you. It is written from the perspective Linux kernel running on AMD64 / x86-64 architecture, but the general concepts applies to most operating systems and CPU architectures.

In summary to minimize latency introduced by the virtual memory abstraction you should:

Minimize page faults by pre-faulting, locking and pre-allocating needed memory. Disable swap.
Reduce TLB misses by minimizing your working set memory and utilizing huge pages.
Prevent TLB shootdowns by not modifying your programs page tables after startup.
Prevent stalls due to page cache writeback by not creating file backed writable memory mappings.
Disable Linux transparent huge pages (THP) background defragmentation.
Disable Linux kernel samepage merging (KSM).
Disable Linux automatic NUMA balancing.

Page faults

When reading or writing to file backed memory that is not in the page cache or to anonymous memory that has been swapped out, the kernel must first load the data from the underlying storage device. This is called a major page fault and incurs a similar overhead as issuing a read or write system call.

If the page is already in the page cache you will still incur a minor page fault on first access after calling mmap , during which the page table is updated to point to the correct page. For anonymous memory there will also be a minor page fault on first write access, when a anonymous page is allocated, zeroed and the page table updated. Basically memory mappings are lazily initialized on first use. Note also that access to the page table during a page fault is protected by locks leading to scalability issues in multi-threaded applications.

To avoid page faults you can pre-fault and disable page cache eviction of the needed memory using the mlock system call or the MAP_LOCKED and MAP_POPULATE flags to mmap . You can also disable swap system wide to prevent anonymous memory from being swapped to disk.

You can monitor number of page faults using

ps -eo min_flt,maj_flt,cmd

perf stat -e faults,minor-faults,major-faults

TLB misses

The translation lookaside buffer (TLB) is a on CPU cache that maps virtual to physical addresses. These mappings are maintained for pages typically of size 4 KiB, 2/4 MiB or 1 GiB. Usually there are separate TLBs for data (DTLB) and instructions (ITLB) with a shared second level TLB (STLB). The TLB has a limited number of entries and if a address is not found in the TLB or STLB, the page table data in the CPU caches or main memory needs to be referenced, this is called a TLB miss. The same as a CPU cache miss is more expensive than a cache hit, a TLB miss is more expensive than a TLB hit.

You can minimize TLB misses by reducing your working set size, making sure to pack your data into as few pages as possible. Additionally you can utilize larger page sizes than the default 4 KiB. These larger pages are called huge pages and allows you to reference more data using fewer pages.

TLB usage can be monitored using:

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses

TLB shootdowns

Most processors do not provide coherence guarantees for TLB mappings. Instead the kernel provides this guarantee using a mechanism called TLB shootdown . It operates by sending inter-processor interrupts (IPIs) that runs kernel code to invalidate the stale TLB entries. TLB shootdowns causes each affected core to context switch into the kernel and thus causes latency spikes for the process running on the affected cores. It will also cause TLB misses when a address with an invalidated page table entry is subsequently accessed.

Any operation that narrows a process’ access to memory like munmap and mprotect will cause a TLB shootdown. Calls to the C standard library allocator ( malloc , free , etc) will call madvise(...MADV_FREE) / munmap internally, but not necessarily on each invocation. TLB shootdowns will also occur during page cache writeback.

To avoid TLB shootdowns you can map all needed memory at program startup and avoid calling any functions that modifies the page table after that. The mimalloc allocator can be tuned to allocate huge pages at program startup ( MIMALLOC_RESERVE_HUGE_OS_PAGES=N ) and never return memory to the OS ( MIMALLOC_PAGE_RESET=0 ).

You can monitor the number of TLB shootdowns in /proc/interrupts .

Page cache writeback

When a page in the page cache has been modified it is marked as dirty and needs to be eventually written back to disk. This process is called writeback and is triggered automatically on a timer or when specifically requested using the system calls fsync , fdatasync , sync , syncfs , msync , and others. If any of the dirty pages are part of a writable memory mapping, the writeback process must first update the page table to mark the page as read-only before writing it to disk. Any subsequent memory write to the page will cause a page fault, letting the kernel update the page cache state to dirty and mark the page writable again. In practice this means that writeback causes TLB shootdowns and that writes to pages that are currently being written to disk must stall until the disk write is complete. This leads to latency spikes for any process that is using file backed writable memory mappings .

To avoid latency spikes due to page cache writeback you cannot create any file backed writable memory mappings. Creating anonymous writable memory mappings using mmap(...MAP_ANONYMOUS) or by mapping files on Linux tmpfs filesystem is fine.

I wrote a small program to demonstrate this effect .

Transparent hugepages

On Linux transparent huge page (THP) support should cause TLB shootdowns when memory regions are compacted / defragmented, but I have not verified that.

THP can be beneficial for reducing TLB misses, but you should disable khugepaged background defragmentation to avoid any latency spikes due to the defragmentation process:

echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag

There is some ongoing work to support THP for excutables: https://lwn.net/Articles/789159/

Kernel samepage merging

Linux kernel samepage merging (KSM) is a feature that can de-duplicate pages with identical data. The merging process will lead to TLB shootdowns and unpredictable memory access latencies.

Make sure kernel samepage merging (KSM) is disabled.

NUMA and Page migration

Non-uniform memory access (NUMA) occurs when the memory access time varies with memory location and processor core. You need to take this into account when designing your system.

On Linux you can use cpusets , numactl , set_mempolicy and mbind to control the NUMA node memory placement policy.

Additionally Linux supports automatic migration of memory between NUMA nodes . The automatic NUMA balancing will cause page faults and TLB shootdowns and should be disabled:

echo 0 > /proc/sys/kernel/numa_balancing

References

Ulrich Drepper (2007). “What Every Programmer Should Know About Memory”. https://www.akkadia.org/drepper/cpumemory.pdf , https://lwn.net/Articles/250967/
Stack Overflow. “What Every Programmer Should Know About Memory?". https://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory
“The Linux Kernel documentation”. https://www.kernel.org/doc/html/latest/index.html
“AMD64 Architecture Programmer’s Manual”. https://developer.amd.com/resources/developer-guides-manuals/
“Intel® 64 and IA-32 Architectures Software Developer Manuals”. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
Félix Cloutier. “x86 and amd64 instruction reference”. https://www.felixcloutier.com/x86/

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Latency implications of virtual memory

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

你必须知道的213个C语言问题

范立锋、李世欣 / 人民邮电出版社 / 2010-6 / 45.00元

《你必须知道的213个C语言问题》精选了213个在C语言程序设计中经常遇到的问题，目的是帮助读者解决在C语言学习和开发中遇到的实际困难，提高读者学习和开发的效率。这些问题涵盖了C语言与软件开发、C语言基础、编译预处理、字符串、函数、键盘操作、文件、目录和磁盘、数组、指针和结构、DOS服务和BIOS服务、日期和时间、重定向I/O和进程命令、C语言开发常见错误及程序调试等内容，均是作者经过充分的调研，......一起来看看《你必须知道的213个C语言问题》这本书的介绍吧!

码农工具