Latency implications of virtual memory

栏目: IT技术 · 发布时间: 4年前

内容简介:This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading

This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading systems this guide will be useful to you. It is written from the perspective Linux kernel running on AMD64 / x86-64 architecture, but the general concepts applies to most operating systems and CPU architectures.

In summary to minimize latency introduced by the virtual memory abstraction you should:

  • Minimize page faults by pre-faulting, locking and pre-allocating needed memory. Disable swap.
  • Reduce TLB misses by minimizing your working set memory and utilizing huge pages.
  • Prevent TLB shootdowns by not modifying your programs page tables after startup.
  • Prevent stalls due to page cache writeback by not creating file backed writable memory mappings.
  • Disable Linux transparent huge pages (THP) background defragmentation.
  • Disable Linux kernel samepage merging (KSM).
  • Disable Linux automatic NUMA balancing.

Page faults

When reading or writing to file backed memory that is not in the page cache or to anonymous memory that has been swapped out, the kernel must first load the data from the underlying storage device. This is called a major page fault and incurs a similar overhead as issuing a read or write system call.

If the page is already in the page cache you will still incur a minor page fault on first access after calling mmap , during which the page table is updated to point to the correct page. For anonymous memory there will also be a minor page fault on first write access, when a anonymous page is allocated, zeroed and the page table updated. Basically memory mappings are lazily initialized on first use. Note also that access to the page table during a page fault is protected by locks leading to scalability issues in multi-threaded applications.

To avoid page faults you can pre-fault and disable page cache eviction of the needed memory using the mlock system call or the MAP_LOCKED and MAP_POPULATE flags to mmap . You can also disable swap system wide to prevent anonymous memory from being swapped to disk.

You can monitor number of page faults using

ps -eo min_flt,maj_flt,cmd

or

perf stat -e faults,minor-faults,major-faults

TLB misses

The translation lookaside buffer (TLB) is a on CPU cache that maps virtual to physical addresses. These mappings are maintained for pages typically of size 4 KiB, 2/4 MiB or 1 GiB. Usually there are separate TLBs for data (DTLB) and instructions (ITLB) with a shared second level TLB (STLB). The TLB has a limited number of entries and if a address is not found in the TLB or STLB, the page table data in the CPU caches or main memory needs to be referenced, this is called a TLB miss. The same as a CPU cache miss is more expensive than a cache hit, a TLB miss is more expensive than a TLB hit.

You can minimize TLB misses by reducing your working set size, making sure to pack your data into as few pages as possible. Additionally you can utilize larger page sizes than the default 4 KiB. These larger pages are called huge pages and allows you to reference more data using fewer pages.

TLB usage can be monitored using:

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses

TLB shootdowns

Most processors do not provide coherence guarantees for TLB mappings. Instead the kernel provides this guarantee using a mechanism called TLB shootdown . It operates by sending inter-processor interrupts (IPIs) that runs kernel code to invalidate the stale TLB entries. TLB shootdowns causes each affected core to context switch into the kernel and thus causes latency spikes for the process running on the affected cores. It will also cause TLB misses when a address with an invalidated page table entry is subsequently accessed.

Any operation that narrows a process’ access to memory like munmap and mprotect will cause a TLB shootdown. Calls to the C standard library allocator ( malloc , free , etc) will call madvise(...MADV_FREE) / munmap internally, but not necessarily on each invocation. TLB shootdowns will also occur during page cache writeback.

To avoid TLB shootdowns you can map all needed memory at program startup and avoid calling any functions that modifies the page table after that. The mimalloc allocator can be tuned to allocate huge pages at program startup ( MIMALLOC_RESERVE_HUGE_OS_PAGES=N ) and never return memory to the OS ( MIMALLOC_PAGE_RESET=0 ).

You can monitor the number of TLB shootdowns in /proc/interrupts .

Page cache writeback

When a page in the page cache has been modified it is marked as dirty and needs to be eventually written back to disk. This process is called writeback and is triggered automatically on a timer or when specifically requested using the system calls fsync , fdatasync , sync , syncfs , msync , and others. If any of the dirty pages are part of a writable memory mapping, the writeback process must first update the page table to mark the page as read-only before writing it to disk. Any subsequent memory write to the page will cause a page fault, letting the kernel update the page cache state to dirty and mark the page writable again. In practice this means that writeback causes TLB shootdowns and that writes to pages that are currently being written to disk must stall until the disk write is complete. This leads to latency spikes for any process that is using file backed writable memory mappings .

To avoid latency spikes due to page cache writeback you cannot create any file backed writable memory mappings. Creating anonymous writable memory mappings using mmap(...MAP_ANONYMOUS) or by mapping files on Linux tmpfs filesystem is fine.

I wrote a small program to demonstrate this effect .

Transparent hugepages

On Linux transparent huge page (THP) support should cause TLB shootdowns when memory regions are compacted / defragmented, but I have not verified that.

THP can be beneficial for reducing TLB misses, but you should disable khugepaged background defragmentation to avoid any latency spikes due to the defragmentation process:

echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag

There is some ongoing work to support THP for excutables: https://lwn.net/Articles/789159/

Kernel samepage merging

Linux kernel samepage merging (KSM) is a feature that can de-duplicate pages with identical data. The merging process will lead to TLB shootdowns and unpredictable memory access latencies.

Make sure kernel samepage merging (KSM) is disabled.

NUMA and Page migration

Non-uniform memory access (NUMA) occurs when the memory access time varies with memory location and processor core. You need to take this into account when designing your system.

On Linux you can use cpusets , numactl , set_mempolicy and mbind to control the NUMA node memory placement policy.

Additionally Linux supports automatic migration of memory between NUMA nodes . The automatic NUMA balancing will cause page faults and TLB shootdowns and should be disabled:

echo 0 > /proc/sys/kernel/numa_balancing

References


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

新媒体文案创作与传播

新媒体文案创作与传播

秋叶、叶小鱼、勾俊伟 / 人民邮电出版社 / 2017-4 / 39.80元

《新媒体文案创作与传播》共分三篇。第1篇是新媒体文案基础篇,主要讲述了新媒体文案的基本概念、新媒体文案的岗位要求和职业能力素养;第二篇是新媒体文案创意实务篇,主要讲述了新媒体文案的创作思路、新媒体文案的写作技巧、爆款新媒体文案的打造、新媒体销售文案的写作、新媒体对文案传播的新要求、新媒体品-牌文案的写作,以及不同媒介的特征及发布形式;第三篇为新媒体文案相关技能补充,主要讲述的是策划能力。 《新媒体......一起来看看 《新媒体文案创作与传播》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

MD5 加密
MD5 加密

MD5 加密工具