内容简介:I’ve been porting theI sampled a few people about bitwise shifts vs. multiplication/division and many had heard anecdotal claims of shifts having better performance, but everyone remained skeptical of whether it was true. Some assumed that one of the compi
I’ve been porting the AndroidX collection library to Kotlin multiplatform to experiment with binary compatibility, performance, tooling, and the different memory models. Some of the data structures in the library use array-based binary trees to store elements. The Java code has a lot of shifts to replace power-of-two multiplications and divides. When ported to Kotlin, these turn into the slightly-awkward infix operators which further obfuscate the intent of the code.
I sampled a few people about bitwise shifts vs. multiplication/division and many had heard anecdotal claims of shifts having better performance, but everyone remained skeptical of whether it was true. Some assumed that one of the compilers seen before the code ran on a CPU would handle optimizing this case.
In an effort to satisfy my curiosity (and partially to avoid Kotlin’s infix bitwise operators) I set out to answer which is better and some other related questions. Let’s go!
Does anyone optimize this?
There are three major compilers that code passes through before it hits the CPU: javac
/ kotlinc
, D8/R8, and ART.
Each of these has the opportunity to optimize. But do they?
javac
class Example { static int multiply(int value) { return value * 2; } static int divide(int value) { return value / 2; } static int shiftLeft(int value) { return value << 1; } static int shiftRight(int value) { return value >> 1; } }
This Java can be compiled with javac
from JDK 14 and the resulting bytecode can be displayed with javap
.
$ javac Example.java $ javap Example Compiled from "Example.java" class Example { static int multiply(int); Code: 0: iload_0 1: iconst_2 2: imul 3: ireturn static int divide(int); Code: 0: iload_0 1: iconst_2 2: idiv 3: ireturn static int shiftLeft(int); Code: 0: iload_0 1: iconst_1 2: ishl 3: ireturn static int shiftRight(int); Code: 0: iload_0 1: iconst_1 2: ishr 3: ireturn }
Every method starts with iload_0
which loads the first argument value. The multiply and divide methods both then have iconst_2
which loads the constant value 2. Each then runs imul
or idiv
to perform integer multiplication or integer division, respectively. The shift methods load the constant value 1 before ishl
or ishr
which is an integer shift left or integer shift right, respectively.
No optimization here, but if you know anything about Java this isn’t unexpected. javac
isn’t an optimizing compiler and leaves the majority of the work to its runtime compilers on the JVM or ahead-of-time compilers.
kotlinc
fun multiply(value: Int) = value * 2 fun divide(value: Int) = value / 2 fun shiftLeft(value: Int) = value shl 1 fun shiftRight(value: Int) = value shr 1
The Kotlin is compiled to Java bytecode with kotlinc
from Kotlin 1.4-M1 where the javap
tool can once again be used.
$ kotlinc Example.kt $ javap -c ExampleKt Compiled from "Example.kt" public final class ExampleKt { public static final int multiply(int); Code: 0: iload_0 1: iconst_2 2: imul 3: ireturn public static final int divide(int); Code: 0: iload_0 1: iconst_2 2: idiv 3: ireturn public static final int shiftLeft(int); Code: 0: iload_0 1: iconst_1 2: ishl 3: ireturn public static final int shiftRight(int); Code: 0: iload_0 1: iconst_1 2: ishr 3: ireturn }
Exactly the same output as Java. This is using the original JVM backend of Kotlin, but using the forthcoming IR-based backend (via -Xuse-ir
) also produces the same output.
D8
We’ll use the Java bytecode output from the Kotlin example as input to the latest D8 built from master
(SHA 2a2bf622d
at the time of writing).
$ java -jar $R8_HOME/build/libs/d8.jar \ --release \ --output . \ ExampleKt.class $ dexdump -d classes.dex Opened 'classes.dex', DEX version '035' Class #0 - Class descriptor : 'LExampleKt;' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object;' Direct methods - #0 : (in LExampleKt;) name : 'divide' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000118: |[000118] ExampleKt.divide:(I)I 000128: db00 0102 |0000: div-int/lit8 v0, v1, #int 2 // #02 00012c: 0f00 |0002: return v0 #1 : (in LExampleKt;) name : 'multiply' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000130: |[000130] ExampleKt.multiply:(I)I 000140: da00 0102 |0000: mul-int/lit8 v0, v1, #int 2 // #02 000144: 0f00 |0002: return v0 #2 : (in LExampleKt;) name : 'shiftLeft' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000148: |[000148] ExampleKt.shiftLeft:(I)I 000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01 00015c: 0f00 |0002: return v0 #3 : (in LExampleKt;) name : 'shiftRight' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000160: |[000160] ExampleKt.shiftRight:(I)I 000170: e100 0101 |0000: shr-int/lit8 v0, v1, #int 1 // #01 000174: 0f00 |0002: return v0
(Note: output slightly trimmed)
Dalvik bytecode is register-based instead of stack-based like Java bytecode. As a result, each method only has one real bytecode which does the associated integer operation. Each uses the v1 register which will be the first argument value and an integer literal of 2 or 1.
So no change behavior, but D8 isn’t an optimizing compiler (although it can do method-local optimization ).
R8
To run R8 we need to define a rule in order to keep our methods from being removed.
-keep,allowoptimization class ExampleKt { <methods>; }
The rules are passed with --pg-conf
and we also supply the Android APIs to link against using --lib
.
$ java -jar $R8_HOME/build/libs/r8.jar \ --lib $ANDROID_HOME/platforms/android-29/android.jar \ --release \ --pg-conf rules.txt \ --output . \ ExampleKt.class $ dexdump -d classes.dex Opened 'classes.dex', DEX version '035' Class #0 - Class descriptor : 'LExampleKt;' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object;' Direct methods - #0 : (in LExampleKt;) name : 'divide' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000118: |[000118] ExampleKt.divide:(I)I 000128: db00 0102 |0000: div-int/lit8 v0, v1, #int 2 // #02 00012c: 0f00 |0002: return v0 #1 : (in LExampleKt;) name : 'multiply' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000130: |[000130] ExampleKt.multiply:(I)I 000140: da00 0102 |0000: mul-int/lit8 v0, v1, #int 2 // #02 000144: 0f00 |0002: return v0 #2 : (in LExampleKt;) name : 'shiftLeft' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000148: |[000148] ExampleKt.shiftLeft:(I)I 000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01 00015c: 0f00 |0002: return v0 #3 : (in LExampleKt;) name : 'shiftRight' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000160: |[000160] ExampleKt.shiftRight:(I)I 000170: e100 0101 |0000: shr-int/lit8 v0, v1, #int 1 // #01 000174: 0f00 |0002: return v0
Same exact output as D8.
ART
We’ll use the Dalvik bytecode output from the R8 example as the input to ART running on Android 10 on an x86 emulator.
$ adb push classes.dex /sdcard/classes.dex $ adb shell generic_x86:/ $ su generic_x86:/ # dex2oat --dex-file=/sdcard/classes.dex --oat-file=/sdcard/classes.oat generic_x86:/ # oatdump --oat-file=/sdcard/classes.oat OatDexFile: 0: LExampleKt; (offset=0x000003c0) (type_idx=1) (Initialized) (OatClassAllCompiled) 0: int ExampleKt.divide(int) (dex_method_idx=0) CODE: (code_offset=0x00001010 size_offset=0x0000100c size=15)... 0x00001010: 89C8 mov eax, ecx 0x00001012: 8D5001 lea edx, [eax + 1] 0x00001015: 85C0 test eax, eax 0x00001017: 0F4DD0 cmovnl/ge edx, eax 0x0000101a: D1FA sar edx 0x0000101c: 89D0 mov eax, edx 0x0000101e: C3 ret 1: int ExampleKt.multiply(int) (dex_method_idx=1) CODE: (code_offset=0x00001030 size_offset=0x0000102c size=5)... 0x00001030: D1E1 shl ecx 0x00001032: 89C8 mov eax, ecx 0x00001034: C3 ret 2: int ExampleKt.shiftLeft(int) (dex_method_idx=2) CODE: (code_offset=0x00001030 size_offset=0x0000102c size=5)... 0x00001030: D1E1 shl ecx 0x00001032: 89C8 mov eax, ecx 0x00001034: C3 ret 3: int ExampleKt.shiftRight(int) (dex_method_idx=3) CODE: (code_offset=0x00001040 size_offset=0x0000103c size=5)... 0x00001040: D1F9 sar ecx 0x00001042: 89C8 mov eax, ecx 0x00001044: C3 ret
(Note: output significantly trimmed)
The x86 assembly reveals that ART has indeed stepped in and normalized the arithmetic operations to use shifts!
First, multiply
and shiftLeft
now have the exact same implementation. They both use shl
for a left bitwise shift of 1. Beyond this, if you look at the offsets in the file (the leftmost column), they are actually the same. ART has recognized these functions have the same body when compiled into x86 assembly and has de-duplicated them.
Next, while divide
and shiftRight
are not the same, they do share the use of sar
for a right bitwise shift of 1. The four additional instructions in divide
that precede sar
handle the case when the input is negative by adding 1 to the value.
Running the same commands on a Pixel 4 running Android 10 shows how ART compiles this code to ARM assembly.
OatDexFile: 0: LExampleKt; (offset=0x000005a4) (type_idx=1) (Verified) (OatClassAllCompiled) 0: int ExampleKt.divide(int) (dex_method_idx=0) CODE: (code_offset=0x00001009 size_offset=0x00001004 size=10)... 0x00001008: 0fc8 lsrs r0, r1, #31 0x0000100a: 1841 adds r1, r0, r1 0x0000100c: 1049 asrs r1, #1 0x0000100e: 4608 mov r0, r1 0x00001010: 4770 bx lr 1: int ExampleKt.multiply(int) (dex_method_idx=1) CODE: (code_offset=0x00001021 size_offset=0x0000101c size=4)... 0x00001020: 0048 lsls r0, r1, #1 0x00001022: 4770 bx lr 2: int ExampleKt.shiftLeft(int) (dex_method_idx=2) CODE: (code_offset=0x00001021 size_offset=0x0000101c size=4)... 0x00001020: 0048 lsls r0, r1, #1 0x00001022: 4770 bx lr 3: int ExampleKt.shiftRight(int) (dex_method_idx=3) CODE: (code_offset=0x00001031 size_offset=0x0000102c size=4)... 0x00001030: 1048 asrs r0, r1, #1 0x00001032: 4770 bx lr
Once again multiply
and shiftLeft
both use lsls
for a left shift and were de-duplicated and shiftRight
uses asrs
for a right shift. divide
is also using asrs
for its right shift, but it uses another right shift, lsrs
, to handle adding 1 for negative values.
With this we can now definitively say that replacing value * 2
with value << 1
offers no benefit. Stop doing it for arithmetic operations and reserve it only for strictly bitwise things!
However, value / 2
and value >> 1
still produce different assembly instructions and thus presumably have different performance characteristics. Thankfully, doing value / 2
avoids using generic division and is still primarily based on right shift, so they’re likely not that far apart in terms of performance.
Is shift faster than division?
To determine whether a divide or shift is faster we can use the Jetpack benchmark library.
class DivideOrShiftTest { @JvmField @Rule val benchmark = BenchmarkRule() @Test fun divide() { val value = "4".toInt() // Ensure not a constant. var result = 0 benchmark.measureRepeated { result = value / 2 } println(result) // Ensure D8 keeps computation. } @Test fun shift() { val value = "4".toInt() // Ensure not a constant. var result = 0 benchmark.measureRepeated { result = value shr 1 } println(result) // Ensure D8 keeps computation. } }
I don’t have any x86 devices but I do have an ARM-based Pixel 3 running Android 10. Here are the results:
android.studio.display.benchmark=4 ns DivideOrShiftTest.divide count=4006 mean=4 median=4 min=4 standardDeviation=0 android.studio.display.benchmark=3 ns DivideOrShiftTest.shift count=3943 mean=3 median=3 min=3 standardDeviation=0
There’s effectively zero difference between using division versus a shift with numbers this small. Those are nanoseconds, after all. Using a negative number shows no difference in the result.
With this we can now definitely say that replacing value / 2
with value >> 1
offers no benefit. Stop doing it for arithmetic operations and reserve it only for strictly bitwise things!
Can D8/R8 use this information to save APK size?
Given two different ways to express the same operations we should choose the one that has the better performance. But if both have the same performance, we should choose whichever results in a smaller APK size.
We know that value * 2
and value << 1
produce the same assembly from ART. Thus, if one is more space-efficient than the other in Dalvik bytecode we should unconditionally rewrite it into the smaller form. Looking at the output from D8 these produce the same size bytecode:
#1 : (in LExampleKt;) name : 'multiply' ⋮ 000140: da00 0102 |0000: mul-int/lit8 v0, v1, #int 2 // #02 #2 : (in LExampleKt;) name : 'shiftLeft' ⋮ 000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01
While there are no gains to be had for this power of 2, the multiplication runs out of bytecode space before the shift for storing the literal value. Here’s value * 32_768
compared to value << 15
:
#1 : (in LExampleKt;) name : 'multiply' ⋮ 000128: 1400 0080 0000 |0000: const v0, #float 0.000000 // #00008000 00012e: 9201 0100 |0003: mul-int v1, v1, v0 #2 : (in LExampleKt;) name : 'shiftLeft' ⋮ 00015c: e000 000f |0000: shl-int/lit8 v0, v0, #int 15 // #0f
I have filed an issue on D8 to investigate optimizing this automatically, but I strongly suspect the cases where it applies to be near zero so it’s likely not worthwhile.
The output of D8 and R8 also tell us that value / 2
and value >> 1
cost the same in terms of Dalvik bytecode.
#0 : (in LExampleKt;) name : 'divide' ⋮ 000128: db00 0102 |0000: div-int/lit8 v0, v1, #int 2 // #02 #2 : (in LExampleKt;) name : 'shiftLeft' ⋮ 000158: e000 0101 |0000: shl-int/lit8 v0, v1, #int 1 // #01
These will also diverge in bytecode size when the literal reaches 32,768. Unconditionally replacing a power-of-two division with a right shift is never safe because of the behavior around negatives. We could do the replacement if the value was guaranteed to be non-negative, but D8 and R8 do not track the possible ranges of integer values at this time.
Does unsigned number power-of-two division use shift?
Java bytecode lacks unsigned numbers, but you can emulate them by using the signed counterparts. In Java there are static helper methods for operating on signed types as unsigned values. Kotlin offers types like UInt
which does similar things but completely abstracted behind a type. It’s conceivable then that when using division by a power-of-two that it could be rewritten as a shift.
We can use Kotlin to model both of these cases.
fun javaLike(value: Int) = Integer.divideUnsigned(value, 2) fun kotlinLike(value: UInt) = value / 2U
There’s a few cases to look at just with how the code is compiled. We’ll start with plain kotlinc
(again with Kotlin 1.4-M1).
$ kotlinc Example.kt $ javap -c ExampleKt Compiled from "Example.kt" public final class ExampleKt { public static final int javaLike(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturn public static final int kotlinLike-WZ4Q5Ns(int); Code: 0: iload_0 1: istore_1 2: iconst_2 3: istore_2 4: iconst_0 5: istore_3 6: iload_1 7: iload_2 8: invokestatic #20 // Method kotlin/UnsignedKt."uintDivide-J1ME1BU":(II)I 11: ireturn }
Kotlin does not recognize this as a power-of-two division where it could use the iushr
bytecode. I’ve filed KT-38493 to track adding this behavior.
Using -Xuse-ir
doesn’t change anything (except remove some of the load/store noise). However, targeting Java 8 does.
$ kotlinc -jvm-target 1.8 Example.kt $ javap -c ExampleKt Compiled from "Example.kt" public final class ExampleKt { public static final int javaLike(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturn public static final int kotlinLike-WZ4Q5Ns(int); Code: 0: iload_0 1: iconst_2 2: invokestatic #12 // Method java/lang/Integer.divideUnsigned:(II)I 5: ireturn }
The Integer.divideUnsigned
method is available as of Java 8 so it’s prefered when targeting 1.8 or newer. Since this makes both function bodies identical, let’s revert back to the old output just to see what happens with it in comparison.
Next up is R8. Notably different from when it was invoked above is that we include the Kotlin stdlib as an input and we also pass --min-api 24
since Integer.divideUnsigned
is only available on API 24 and newer.
$ java -jar $R8_HOME/build/libs/r8.jar \ --lib $ANDROID_HOME/platforms/android-29/android.jar \ --min-api 24 \ --release \ --pg-conf rules.txt \ --output . \ ExampleKt.class kotlin-stdlib.jar $ dexdump -d classes.dex Opened 'classes.dex', DEX version '039' Class #0 - Class descriptor : 'LExampleKt;' Access flags : 0x0011 (PUBLIC FINAL) Superclass : 'Ljava/lang/Object;' Direct methods - #0 : (in LExampleKt;) name : 'javaLike' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 0000f8: |[0000f8] ExampleKt.javaLike:(I)I 000108: 1220 |0000: const/4 v0, #int 2 // #2 00010a: 7120 0200 0100 |0001: invoke-static {v1, v0}, Ljava/lang/Integer;.divideUnsigned:(II)I // method@0002 000110: 0a01 |0004: move-result v1 000112: 0f01 |0005: return v1 #1 : (in LExampleKt;) name : 'kotlinLike-WZ4Q5Ns' type : '(I)I' access : 0x0019 (PUBLIC STATIC FINAL) code - 000114: |[000114] ExampleKt.kotlinLike-WZ4Q5Ns:(I)I 000124: 8160 |0000: int-to-long v0, v6 000126: 1802 ffff ffff 0000 0000 |0001: const-wide v2, #double 0.000000 // #00000000ffffffff 000130: c020 |0006: and-long/2addr v0, v2 000132: 1226 |0007: const/4 v6, #int 2 // #2 000134: 8164 |0008: int-to-long v4, v6 000136: c042 |0009: and-long/2addr v2, v4 000138: be20 |000a: div-long/2addr v0, v2 00013a: 8406 |000b: long-to-int v6, v0 00013c: 0f06 |000c: return v6
Kotlin has its own unsigned integer division implementation which was inlined into our function. It converts the input argument and the literal to longs, performs long division, and then converts back to int. When we eventually run them through ART they’re just translated to equivalent x86 so we’re going to leave this function behind. The opportunity for optimization here was already missed.
For the Java version, R8 failed to replace the divideUnsigned
call with a shift. I’ve filed issue 154712996 to track this for D8 and R8.
The last opportunity to optimize this case is ART.
$ adb push classes.dex /sdcard/classes.dex $ adb shell generic_x86:/ $ su generic_x86:/ # dex2oat --dex-file=/sdcard/classes.dex --oat-file=/sdcard/classes.oat generic_x86:/ # oatdump --oat-file=/sdcard/classes.oat OatDexFile: 0: LExampleKt; (offset=0x000003c0) (type_idx=1) (Initialized) (OatClassAllCompiled) 0: int ExampleKt.javaLike(int) (dex_method_idx=0) CODE: (code_offset=0x00001010 size_offset=0x0000100c size=63)... 0x00001010: 85842400E0FFFF test eax, [esp + -8192] StackMap[0] (native_pc=0x1017, dex_pc=0x0, register_mask=0x0, stack_mask=0b) 0x00001017: 55 push ebp 0x00001018: 83EC18 sub esp, 24 0x0000101b: 890424 mov [esp], eax 0x0000101e: 6466833D0000000000 cmpw fs:[0x0], 0 ; state_and_flags 0x00001027: 0F8519000000 jnz/ne +25 (0x00001046) 0x0000102d: E800000000 call +0 (0x00001032) 0x00001032: 5D pop ebp 0x00001033: BA02000000 mov edx, 2 0x00001038: 8B85CE0F0000 mov eax, [ebp + 4046] 0x0000103e: FF5018 call [eax + 24] StackMap[1] (native_pc=0x1041, dex_pc=0x1, register_mask=0x0, stack_mask=0b) 0x00001041: 83C418 add esp, 24 0x00001044: 5D pop ebp 0x00001045: C3 ret 0x00001046: 64FF15E0020000 call fs:[0x2e0] ; pTestSuspend StackMap[2] (native_pc=0x104d, dex_pc=0x0, register_mask=0x0, stack_mask=0b) 0x0000104d: EBDE jmp -34 (0x0000102d) 1: int ExampleKt.kotlinLike-WZ4Q5Ns(int) (dex_method_idx=1) CODE: (code_offset=0x00001060 size_offset=0x0000105c size=67)... ⋮
ART does not intrinsify calls to divideUnsigned
so instead we get the machinery to jump to the regular method implementation. I filed issue 154693569 to track adding the ART intrinsics for unsigned divide.
Well that certainly was a journey. Congrats if you made it this far (or just scrolled to the bottom). Let’s summarize:
- ART rewrites power-of-two multiplication to left shift and power-of-two division to right shift (with a few extra instructions to handle negatives).
- There is no observable performance difference between a right shift and power-of-two division.
- There is no size difference in Dalvik bytecode between shifts and multiply/divide.
- Nobody optimizes unsigned division (yet), but you’re probably not using it anyway.
With these facts we can answer the title of this post:
Which is better on Android: divide by 2 or shift by 1?
Neither! So use division for arithmetic and only use shifts for actual bitwise operations. I’ll be switching the AndroidX collection port from shifts to multiply and divide. See you next time.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Python学习手册
Mark Lutz / 侯靖 / 机械工业出版社 / 2009-8 / 89.00元
《Python学习手册(第3版)》讲述了:Python可移植、功能强大、易于使用,是编写独立应用程序和脚本应用程序的理想选择。无论你是刚接触编程或者刚接触Python,通过学习《Python学习手册(第3版)》,你可以迅速高效地精通核心Python语言基础。读完《Python学习手册(第3版)》,你会对这门语言有足够的了解,从而可以在你所从事的任何应用领域中使用它。 《Python学习手册(......一起来看看 《Python学习手册》 这本书的介绍吧!