内容简介:This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading
This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading systems this guide will be useful to you. It is written from the perspective Linux kernel running on AMD64 / x86-64 architecture, but the general concepts applies to most operating systems and CPU architectures.
In summary to minimize latency introduced by the virtual memory abstraction you should:
- Minimize page faults by pre-faulting, locking and pre-allocating needed memory. Disable swap.
- Reduce TLB misses by minimizing your working set memory and utilizing huge pages.
- Prevent TLB shootdowns by not modifying your programs page tables after startup.
- Prevent stalls due to page cache writeback by not creating file backed writable memory mappings.
- Disable Linux transparent huge pages (THP) background defragmentation.
- Disable Linux kernel samepage merging (KSM).
- Disable Linux automatic NUMA balancing.
Page faults
When reading or writing to file backed memory that is not in the page
cache
or to anonymous memory
that has been swapped out,
the kernel must first load the data from the underlying storage device. This is
called a major page fault and incurs a similar overhead as issuing a read
or write
system call.
If the page is already in the page cache you will still incur a minor page fault
on first access after calling
mmap
, during which the page table is
updated to point to the correct page. For anonymous memory there will also be a
minor page fault on first write access, when a anonymous page is allocated,
zeroed and the page table updated. Basically memory mappings are lazily
initialized on first use. Note also that access to the page table during a page
fault is protected by locks leading to scalability issues in multi-threaded
applications.
To avoid page faults you can pre-fault and disable page cache eviction of the
needed memory using the
mlock
system call or the MAP_LOCKED
and MAP_POPULATE
flags to
mmap
. You can also disable swap system wide to
prevent anonymous memory from being swapped to disk.
You can monitor number of page faults using
ps -eo min_flt,maj_flt,cmd
or
perf stat -e faults,minor-faults,major-faults
TLB misses
The translation lookaside buffer (TLB) is a on CPU cache that maps virtual to physical addresses. These mappings are maintained for pages typically of size 4 KiB, 2/4 MiB or 1 GiB. Usually there are separate TLBs for data (DTLB) and instructions (ITLB) with a shared second level TLB (STLB). The TLB has a limited number of entries and if a address is not found in the TLB or STLB, the page table data in the CPU caches or main memory needs to be referenced, this is called a TLB miss. The same as a CPU cache miss is more expensive than a cache hit, a TLB miss is more expensive than a TLB hit.
You can minimize TLB misses by reducing your working set size, making sure to pack your data into as few pages as possible. Additionally you can utilize larger page sizes than the default 4 KiB. These larger pages are called huge pages and allows you to reference more data using fewer pages.
TLB usage can be monitored using:
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses
TLB shootdowns
Most processors do not provide coherence guarantees for TLB mappings. Instead the kernel provides this guarantee using a mechanism called TLB shootdown . It operates by sending inter-processor interrupts (IPIs) that runs kernel code to invalidate the stale TLB entries. TLB shootdowns causes each affected core to context switch into the kernel and thus causes latency spikes for the process running on the affected cores. It will also cause TLB misses when a address with an invalidated page table entry is subsequently accessed.
Any operation that narrows a process’ access to memory like munmap
and mprotect
will cause a TLB shootdown. Calls to the C standard library allocator
( malloc
, free
, etc) will call madvise(...MADV_FREE)
/ munmap
internally,
but not necessarily on each invocation. TLB shootdowns will also occur during
page cache writeback.
To avoid TLB shootdowns you can map all needed memory at program startup and
avoid calling any functions that modifies the page table after that. The mimalloc allocator
can be tuned to
allocate huge pages at program startup ( MIMALLOC_RESERVE_HUGE_OS_PAGES=N
) and
never return memory to the OS ( MIMALLOC_PAGE_RESET=0
).
You can monitor the number of TLB shootdowns in /proc/interrupts
.
Page cache writeback
When a page in the page cache
has been modified it is marked as
dirty and needs to be eventually written back to disk. This process is called
writeback and is triggered automatically on a timer or when specifically
requested using the system calls fsync
, fdatasync
, sync
, syncfs
, msync
, and others. If any of the dirty pages are part of a writable memory
mapping, the writeback process must first update the page table to mark the page
as read-only before writing it to disk. Any subsequent memory write to the page
will cause a page fault, letting the kernel update the page cache state to dirty
and mark the page writable again. In practice this means that writeback causes
TLB shootdowns and that writes to pages that are currently being written to disk
must stall until the disk write is complete. This leads to latency spikes for
any process that is using file backed writable memory mappings
.
To avoid latency spikes due to page cache writeback
you cannot create any file
backed writable memory mappings. Creating anonymous writable memory mappings
using mmap(...MAP_ANONYMOUS)
or by mapping files on
Linux tmpfs
filesystem
is fine.
I wrote a small program to demonstrate this effect .
Transparent hugepages
On Linux transparent huge page (THP) support should cause TLB shootdowns when memory regions are compacted / defragmented, but I have not verified that.
THP can be beneficial for reducing TLB misses, but you should disable khugepaged background defragmentation to avoid any latency spikes due to the defragmentation process:
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
There is some ongoing work to support THP for excutables: https://lwn.net/Articles/789159/
Kernel samepage merging
Linux kernel samepage merging (KSM) is a feature that can de-duplicate pages with identical data. The merging process will lead to TLB shootdowns and unpredictable memory access latencies.
Make sure kernel samepage merging (KSM) is disabled.
NUMA and Page migration
Non-uniform memory access (NUMA) occurs when the memory access time varies with memory location and processor core. You need to take this into account when designing your system.
On Linux you can use cpusets
,
numactl
,
set_mempolicy
and
mbind
to control the NUMA node
memory placement policy.
Additionally Linux supports automatic migration of memory between NUMA nodes . The automatic NUMA balancing will cause page faults and TLB shootdowns and should be disabled:
echo 0 > /proc/sys/kernel/numa_balancing
References
- Ulrich Drepper (2007). “What Every Programmer Should Know About Memory”. https://www.akkadia.org/drepper/cpumemory.pdf , https://lwn.net/Articles/250967/
- Stack Overflow. “What Every Programmer Should Know About Memory?". https://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory
- “The Linux Kernel documentation”. https://www.kernel.org/doc/html/latest/index.html
- “AMD64 Architecture Programmer’s Manual”. https://developer.amd.com/resources/developer-guides-manuals/
- “Intel® 64 and IA-32 Architectures Software Developer Manuals”. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
- Félix Cloutier. “x86 and amd64 instruction reference”. https://www.felixcloutier.com/x86/
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
正则表达式在线测试
正则表达式在线测试
HEX CMYK 转换工具
HEX CMYK 互转工具