Unaligned memory access

栏目: IT技术 · 发布时间: 4年前

内容简介:11 Jan 2020A core feature of capnproto-rust is its ability to read messages directly from memory without copying the data into auxiliary structures. Unfortunately, this functionality is a bit tricky to use correctly, as can be seen in its primary interface

11 Jan 2020

A core feature of capnproto-rust is its ability to read messages directly from memory without copying the data into auxiliary structures. Unfortunately, this functionality is a bit tricky to use correctly, as can be seen in its primary interface, the read_message_from_words() function, whose input is of type &[Word] . In the common case where you want to read from a &[u8] , you must first call the unsafe function bytes_to_words() in order to get a &[Word] . It is only safe to call this function if you know that your data is 8-byte aligned or if you know that your code will only run on processors that permit unaligned memory access (EDIT: ralfj informs me that misaligned loads are never okay.) The former condition can be difficult to meet, especially if your memory comes from an external library like sqlite or zmq where no alignment guarantees are given , and the latter condition feels like an unfair burden, both in terms of demanding that you understand a rather subtle concept, and in terms of limiting where your software can run . So it’s easy to understand why someone might shy away from calling bytes_to_words() and, in turn, read_message_from_words() .

Can we do better? Ideally, capnproto-rust would safely operate directly on input of type &[u8] . We can in fact adapt the code to do that, but it comes at a cost: processors that don’t natively support unaligned access will need to do some more work every time that capnproto-rust loads or stores a multi-byte value. To get some idea of what that extra work looks like, let’s examine the assembly code emitted by rustc! (A better way to quantify the cost would be to perform controlled experiments on actual hardware, but that’s a more involved project than I’d like to tackle right now.)

Below is some code representing a bare-bones simplification of the two approaches to memory access. (The #[no_std] and #[no_mangle] attributes are to simpify the assembly code.)

#![no_std]

#[no_mangle]
pub unsafe fn direct_load(x: &[u8; 8]) -> u64 {
    (*(x.as_ptr() as *const u64)).to_le()
}

#[no_mangle]
pub fn indirect_load(x: &[u8; 8]) -> u64 {
    u64::from_le_bytes(*x)
}

The direct_load() function represents the current state of affairs in capnproto-rust. It loads a u64 by casting a pointer of type *const u8 to type *const u64 and then deferencing that pointer. This is only safe if the input is 8-byte aligned or if the processor can handle unaligned access . (EDIT: again, see ralfj’s reddit comment .)

The indirect_load() function represents the safer alternative. We expect this to sometimes require more work than direct_load() , but it has the advantage of being easier to use and understand.

To compare the assembly code generated by these functions, I installed a variety of rustc targets using rustup :

rustup target add $TARGET

and then for each target compiled the code with:

rustc -O --crate-type=lib test.rs --target=$TARGET --emit=asm

The results, edited to only include the relevant bits of code, are show below.

x86_64-unknown-linux-gnu

direct_load:
	movq	(%rdi), %rax
	retq

indirect_load:
	movq	(%rdi), %rax
	retq

i686-unknown-linux-gnu

direct_load:
	movl	4(%esp), %ecx
	movl	(%ecx), %eax
	movl	4(%ecx), %edx
	retl

indirect_load:
	movl	4(%esp), %ecx
	movl	(%ecx), %eax
	movl	4(%ecx), %edx
	retl

aarch64-unknown-linux-gnu

direct_load:
	ldr	x0, [x0]
	ret

indirect_load:
	ldr	x0, [x0]
	ret

wasm32-wasi

direct_load:
	local.get	0
	i64.load	0

indirect_load:
	local.get	0
	i64.load	0:p2align=0

armv7-unknown-linux-gnueabi

direct_load:
	ldrd	r0, r1, [r0]
	bx	lr

indirect_load:
	ldr	r2, [r0]
	ldr	r1, [r0, #4]
	mov	r0, r2
	bx	lr

powerpc-unknown-linux-gnu

direct_load:
	li 4, 4
	lwbrx 5, 3, 4
	lwbrx 4, 0, 3
	mr 3, 5
	blr

indirect_load:
	li 4, 4
	lwbrx 5, 3, 4
	lwbrx 4, 0, 3
	mr 3, 5
	blr

mips-unknown-linux-gnu

direct_load:
	lw	$1, 4($4)
	wsbh	$1, $1
	rotr	$2, $1, 16
	lw	$1, 0($4)
	wsbh	$1, $1
	jr	$ra
	rotr	$3, $1, 16

indirect_load:
	lwl	$1, 4($4)
	lwr	$1, 7($4)
	wsbh	$1, $1
	rotr	$2, $1, 16
	lwl	$1, 0($4)
	lwr	$1, 3($4)
	wsbh	$1, $1
	jr	$ra
	rotr	$3, $1, 16

riscv32i-unknown-none-elf

direct_load:
	addi	sp, sp, -16
	sw	ra, 12(sp)
	sw	s0, 8(sp)
	addi	s0, sp, 16
	lw	a2, 0(a0)
	lw	a1, 4(a0)
	mv	a0, a2
	lw	s0, 8(sp)
	lw	ra, 12(sp)
	addi	sp, sp, 16
	ret

indirect_load:
	addi	sp, sp, -16
	sw	ra, 12(sp)
	sw	s0, 8(sp)
	addi	s0, sp, 16
	lbu	a1, 1(a0)
	slli	a1, a1, 8
	lbu	a2, 0(a0)
	or	a1, a1, a2
	lbu	a2, 3(a0)
	slli	a2, a2, 8
	lbu	a3, 2(a0)
	or	a2, a2, a3
	slli	a2, a2, 16
	or	a2, a2, a1
	lbu	a1, 5(a0)
	slli	a1, a1, 8
	lbu	a3, 4(a0)
	or	a1, a1, a3
	lbu	a3, 6(a0)
	lbu	a0, 7(a0)
	slli	a0, a0, 8
	or	a0, a0, a3
	slli	a0, a0, 16
	or	a1, a0, a1
	mv	a0, a2
	lw	s0, 8(sp)
	lw	ra, 12(sp)
	addi	sp, sp, 16
	ret

Conclusions

As expected, direct_load() and indirect_load() generate the same assembly code for many targets. These are presumably exactly the targets that support unaligned memory access. On targets where different instructions were generated for the two functions, indirect_load() typically requires somewhere between 2x and 3x the number of instructions of direct_load() . Is that an acceptable cost? How much of an impact would it have in the context of a complete real-world program? I don’t know! I’m inclined to believe that the usability benefits of the indirect_load() approach outweigh its performance cost, especially since that cost is probably zero or negligible on the most commonly used targets, but maybe that’s not true? I encourage any readers of this post who have thoughts on the matter to comment on this github issue .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

算法之美

算法之美

左飞 / 电子工业出版社 / 2016-3 / 79.00元

《算法之美——隐匿在数据结构背后的原理(C++版)》围绕算法与数据结构这个话题,循序渐进、深入浅出地介绍了现代计算机技术中常用的40 余个经典算法,以及回溯法、分治法、贪婪法和动态规划等算法设计思想。在此过程中,《算法之美——隐匿在数据结构背后的原理(C++版)》也系统地讲解了链表(包括单向链表、单向循环链表和双向循环链表)、栈、队列(包括普通队列和优先级队列)、树(包括二叉树、哈夫曼树、堆、红黑......一起来看看 《算法之美》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

URL 编码/解码
URL 编码/解码

URL 编码/解码

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具