Latency implications of virtual memory

栏目: IT技术 · 发布时间: 4年前

内容简介:This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading

This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading systems this guide will be useful to you. It is written from the perspective Linux kernel running on AMD64 / x86-64 architecture, but the general concepts applies to most operating systems and CPU architectures.

In summary to minimize latency introduced by the virtual memory abstraction you should:

  • Minimize page faults by pre-faulting, locking and pre-allocating needed memory. Disable swap.
  • Reduce TLB misses by minimizing your working set memory and utilizing huge pages.
  • Prevent TLB shootdowns by not modifying your programs page tables after startup.
  • Prevent stalls due to page cache writeback by not creating file backed writable memory mappings.
  • Disable Linux transparent huge pages (THP) background defragmentation.
  • Disable Linux kernel samepage merging (KSM).
  • Disable Linux automatic NUMA balancing.

Page faults

When reading or writing to file backed memory that is not in the page cache or to anonymous memory that has been swapped out, the kernel must first load the data from the underlying storage device. This is called a major page fault and incurs a similar overhead as issuing a read or write system call.

If the page is already in the page cache you will still incur a minor page fault on first access after calling mmap , during which the page table is updated to point to the correct page. For anonymous memory there will also be a minor page fault on first write access, when a anonymous page is allocated, zeroed and the page table updated. Basically memory mappings are lazily initialized on first use. Note also that access to the page table during a page fault is protected by locks leading to scalability issues in multi-threaded applications.

To avoid page faults you can pre-fault and disable page cache eviction of the needed memory using the mlock system call or the MAP_LOCKED and MAP_POPULATE flags to mmap . You can also disable swap system wide to prevent anonymous memory from being swapped to disk.

You can monitor number of page faults using

ps -eo min_flt,maj_flt,cmd

or

perf stat -e faults,minor-faults,major-faults

TLB misses

The translation lookaside buffer (TLB) is a on CPU cache that maps virtual to physical addresses. These mappings are maintained for pages typically of size 4 KiB, 2/4 MiB or 1 GiB. Usually there are separate TLBs for data (DTLB) and instructions (ITLB) with a shared second level TLB (STLB). The TLB has a limited number of entries and if a address is not found in the TLB or STLB, the page table data in the CPU caches or main memory needs to be referenced, this is called a TLB miss. The same as a CPU cache miss is more expensive than a cache hit, a TLB miss is more expensive than a TLB hit.

You can minimize TLB misses by reducing your working set size, making sure to pack your data into as few pages as possible. Additionally you can utilize larger page sizes than the default 4 KiB. These larger pages are called huge pages and allows you to reference more data using fewer pages.

TLB usage can be monitored using:

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses

TLB shootdowns

Most processors do not provide coherence guarantees for TLB mappings. Instead the kernel provides this guarantee using a mechanism called TLB shootdown . It operates by sending inter-processor interrupts (IPIs) that runs kernel code to invalidate the stale TLB entries. TLB shootdowns causes each affected core to context switch into the kernel and thus causes latency spikes for the process running on the affected cores. It will also cause TLB misses when a address with an invalidated page table entry is subsequently accessed.

Any operation that narrows a process’ access to memory like munmap and mprotect will cause a TLB shootdown. Calls to the C standard library allocator ( malloc , free , etc) will call madvise(...MADV_FREE) / munmap internally, but not necessarily on each invocation. TLB shootdowns will also occur during page cache writeback.

To avoid TLB shootdowns you can map all needed memory at program startup and avoid calling any functions that modifies the page table after that. The mimalloc allocator can be tuned to allocate huge pages at program startup ( MIMALLOC_RESERVE_HUGE_OS_PAGES=N ) and never return memory to the OS ( MIMALLOC_PAGE_RESET=0 ).

You can monitor the number of TLB shootdowns in /proc/interrupts .

Page cache writeback

When a page in the page cache has been modified it is marked as dirty and needs to be eventually written back to disk. This process is called writeback and is triggered automatically on a timer or when specifically requested using the system calls fsync , fdatasync , sync , syncfs , msync , and others. If any of the dirty pages are part of a writable memory mapping, the writeback process must first update the page table to mark the page as read-only before writing it to disk. Any subsequent memory write to the page will cause a page fault, letting the kernel update the page cache state to dirty and mark the page writable again. In practice this means that writeback causes TLB shootdowns and that writes to pages that are currently being written to disk must stall until the disk write is complete. This leads to latency spikes for any process that is using file backed writable memory mappings .

To avoid latency spikes due to page cache writeback you cannot create any file backed writable memory mappings. Creating anonymous writable memory mappings using mmap(...MAP_ANONYMOUS) or by mapping files on Linux tmpfs filesystem is fine.

I wrote a small program to demonstrate this effect .

Transparent hugepages

On Linux transparent huge page (THP) support should cause TLB shootdowns when memory regions are compacted / defragmented, but I have not verified that.

THP can be beneficial for reducing TLB misses, but you should disable khugepaged background defragmentation to avoid any latency spikes due to the defragmentation process:

echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag

There is some ongoing work to support THP for excutables: https://lwn.net/Articles/789159/

Kernel samepage merging

Linux kernel samepage merging (KSM) is a feature that can de-duplicate pages with identical data. The merging process will lead to TLB shootdowns and unpredictable memory access latencies.

Make sure kernel samepage merging (KSM) is disabled.

NUMA and Page migration

Non-uniform memory access (NUMA) occurs when the memory access time varies with memory location and processor core. You need to take this into account when designing your system.

On Linux you can use cpusets , numactl , set_mempolicy and mbind to control the NUMA node memory placement policy.

Additionally Linux supports automatic migration of memory between NUMA nodes . The automatic NUMA balancing will cause page faults and TLB shootdowns and should be disabled:

echo 0 > /proc/sys/kernel/numa_balancing

References


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

大数据之眼

大数据之眼

[德]尤夫娜·霍夫施泰特 / 陈巍 / 浙江文艺出版社 / 2018-5-7 / 68.00元

德国狂销10万册的大数据商业应用畅销书,经典之作《大数据时代》的姊妹篇。 该书在德语国家促发了一场关于大数据,人工智能与人的关系建构的大讨论。 德国大数据与人工智能领域权威,首度为中国读者亲笔作序。 在后大数据时代,如何维护自己的隐私,如何巧妙利用资源获得更多金钱? 一部对大数据发展所产生的问题进行思考和规避的先知式作品。 当智能机器欲“优化”我们,入侵我们的生活,统......一起来看看 《大数据之眼》 这本书的介绍吧!

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具