Locating a compiler bug with git bisection

栏目: IT技术 · 发布时间: 4年前

内容简介:Tags:programming,This is just a brief post on the steps I took to confirm, locate, and backport the fix for a compiler bug that was plaguing a (professional) project that I work on. It’s a straightforward application ofI work on LLVM-based program analysis

May 7, 2020

Tags:programming, devblog

This is just a brief post on the steps I took to confirm, locate, and backport the fix for a compiler bug that was plaguing a (professional) project that I work on. It’s a straightforward application of git bisect , with a few observations:

  1. That even compilers have bugs (and that, depending on the context, the compiler might be the most buggy component)
  2. That git bisect is incredibly flexible and can work on extremely complex programs
  3. That backporting a fix isn’t always difficult or fraught

Background: the bug

I work on LLVM-based program analysis tools professionally, and currently work on a project that makes heavy use of a modified build of LLVM with some additional features.

Many of the components that I work on assume that LLVM faithfully emits debug intrinsics (i.e. call void llvm.dbg.whatever ) and location expressions (i.e., DILocation , DebugLoc ) for every source-level variable.

This is normally a reasonable assumption, at least for unoptimized or less-optimized compilations, and allows us to make further assumptions about the information we find in DWARF after lowering.

Here’s a source file (minimized from the original) that, when compiled with -O1 , violated the above assumption:

#include <stdio.h>
#include <stdlib.h>

struct foo {
  char buffer[32];
};

int bar(int index) {
  struct foo stuff = {};
  if (index >= sizeof(stuff.buffer)) {
      return -1;
  }
  return (int)stuff.buffer[index];
}

int main(int argc, char **argv) {
  return bar(atoi(argv[1]));
}

…and here’s the corresponding LLVM IR for bar :

; Function Attrs: nounwind readnone uwtable
define dso_local i32 @bar(i32) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !17, metadata !DIExpression()), !dbg !25
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !26
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #6, !dbg !26
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %3, i8 0, i64 32, i1 false), !dbg !27
  %4 = icmp ugt i32 %0, 31, !dbg !28
  br i1 %4, label %10, label %5, !dbg !30

; <label>:5:                                      ; preds = %1
  %6 = sext i32 %0 to i64, !dbg !31
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !32
  %8 = load i8, i8* %7, align 1, !dbg !32, !tbaa !33
  %9 = sext i8 %8 to i32, !dbg !36
  br label %10, !dbg !37

; <label>:10:                                     ; preds = %1, %5
  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !38
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #6, !dbg !39
  ret i32 %11, !dbg !39
}

Some things immediately stand out:

  • Despite there being two variables in scope (the index argument and stuff ), only one llvm.dbg.* intrinsic is present. Inspecting the debug nodes reveals that it’s for index , not stuff :

    !17 = !DILocalVariable(name: "index", arg: 1, scope: !13, file: !1, line: 8, type: !4)

    (This makes sense, as index is likely to be stored in a register on our target platform and thus has an llvm.dbg.value intrinsic rather than a memory-based one.)

  • We have lifetime intrinsics ( llvm.lifetime.start and llvm.lifetime.end ) for the structure as well as the expected alloca and getelementptr for creating and indexing into it. Some of these (the getelementptr and llvm.lifetime.start ) also have attached metadata that references our expected location for stuff :

    !26 = !DILocation(line: 9, column: 3, scope: !13)
  • The DILocalVariable for stuff is embedded in the IR, but is completely orphaned by the function other than a reference in retainedNodes :

    !13 = distinct !DISubprogram(name: "bar", scope: !1, file: !1, line: 8, type: !14, isLocal: false, isDefinition: true, scopeLine: 8, flags: DIFlagPrototyped,isOptimized: true, unit: !0, retainedNodes: !16)
      !16 = !{!17, !18}
      !18 = !DILocalVariable(name: "stuff", scope: !13, file: !1, line: 9, type: !19)

    The DILocalVariable for index is also in retainedNodes , curiously, despite not being orphaned and therefore not requiring separate retention.

All told, this is extremely suspicious: we’re missing an intrinsic, a debug node is incorrectly marked as “retained”, and there’s no connection between our (orphaned) DILocalVariable and its DILocation .

Unsurprisingly, this variable/location disconnect results in broken DWARF information for stuff , specifically in the form of a missing DW_AT_location (which we would expect, based on the alloca , to reflect a stack location):

0x00000043:   DW_TAG_subprogram
                DW_AT_low_pc    (0x00000000004004d0)
                DW_AT_high_pc   (0x00000000004004f0)
                DW_AT_frame_base        (DW_OP_reg7 RSP)
                DW_AT_name      ("bar")
                DW_AT_decl_file ("/tmp/junk.c")
                DW_AT_decl_line (8)
                DW_AT_prototyped        (true)
                DW_AT_type      (0x0000002a "int")
                DW_AT_external  (true)

0x0000005c:     DW_TAG_formal_parameter
                  DW_AT_location        (DW_OP_reg5 RDI)
                  DW_AT_name    ("index")
                  DW_AT_decl_file       ("/tmp/junk.c")
                  DW_AT_decl_line       (8)
                  DW_AT_type    (0x0000002a "int")

0x00000069:     DW_TAG_variable
                  DW_AT_name    ("stuff")
                  DW_AT_decl_file       ("/tmp/junk.c")
                  DW_AT_decl_line       (9)
                  DW_AT_type    (0x000000ad "foo")

First steps: reproducing

Since we’re dealing with a modified LLVM 7 here, the very first step is to make sure that we’re not the cause of our own problem.

A quick trip to Godbolt settles that, showing the same missing intrinsic on a reference build of LLVM/Clang 7:

define dso_local i32 @bar(i32) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !18, metadata !DIExpression()), !dbg !26
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !27
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #6, !dbg !27
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %3, i8 0, i64 32, i1 false), !dbg !28
  %4 = icmp ugt i32 %0, 31, !dbg !29
  br i1 %4, label %10, label %5, !dbg !31

  %6 = sext i32 %0 to i64, !dbg !32
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !33
  %8 = load i8, i8* %7, align 1, !dbg !33, !tbaa !34
  %9 = sext i8 %8 to i32, !dbg !37
  br label %10, !dbg !38

  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !39
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #6, !dbg !40
  ret i32 %11, !dbg !40
}

( Permalink )

Finding the fix

At this point, the panic settles in: there’s a bug in the compiler that isn’t (directly) our fault, and we need to fix it in order to continue.

Compilers are complicated beasts, and we’d rather not find and fix bugs ourselves if we don’t have to. So, we return to Godbolt and see whether someone else has done the hard work for us sometime between LLVM 7 (released 09/2019) and the latest trunk build.

Indeed! Compiling with a trunk build ( permalink ) shows two debug intrinsics, as expected, with correctly attached location metadata on each call:

define dso_local i32 @bar(i32 %0) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !18, metadata !DIExpression()), !dbg !26
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !27
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #7, !dbg !27
  call void @llvm.dbg.declare(metadata %struct.foo* %2, metadata !19, metadata !DIExpression()), !dbg !28
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 dereferenceable(32) %3, i8 0, i64 32, i1 false), !dbg !28
  %4 = icmp ugt i32 %0, 31, !dbg !29
  br i1 %4, label %10, label %5, !dbg !31

5:                                                ; preds = %1
  %6 = sext i32 %0 to i64, !dbg !32
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !33
  %8 = load i8, i8* %7, align 1, !dbg !33, !tbaa !34
  %9 = sext i8 %8 to i32, !dbg !37
  br label %10, !dbg !38

10:                                               ; preds = %1, %5
  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !26
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #7, !dbg !39
  ret i32 %11, !dbg !39
}

So let’s walk backwards.

LLVM 10 also emits two intrinsics, but LLVM 9 is identical to our buggy LLVM 7 IR:

define dso_local i32 @bar(i32) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !18, metadata !DIExpression()), !dbg !26
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !27
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #6, !dbg !27
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %3, i8 0, i64 32, i1 false), !dbg !28
  %4 = icmp ugt i32 %0, 31, !dbg !29
  br i1 %4, label %10, label %5, !dbg !31

5:                                                ; preds = %1
  %6 = sext i32 %0 to i64, !dbg !32
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !33
  %8 = load i8, i8* %7, align 1, !dbg !33, !tbaa !34
  %9 = sext i8 %8 to i32, !dbg !37
  br label %10, !dbg !38

10:                                               ; preds = %1, %5
  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !26
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #6, !dbg !39
  ret i32 %11, !dbg !39
}

So we know that someone fixed this bug, somewhere between the LLVM 9 (09/2019) and LLVM 10 (03/2020) releases. That’s reassuring, but doesn’t get us any closer to actually patching our build. More work is needed.

Finding the fix, for real

Knowing that our desired bugfix is somewhere in the changeset between LLVM 9 and LLVM 10 leaves us with a few naive strategies:

  • We could grep through the git log between the releases, trying to key on something like DWARF , DebugInfo , bug , or similar.
  • We could try every commit between the two releases, stopping when the behavior changes.

Neither of these is particularly appealing: the unstructured nature of commit messages makes it difficult to ensure that we aren’t over- or under-constraining our search, and individually compiling all 10,000+ commits between the two releases would take us weeks.

git bisect to the rescue

git bisect , unsurprisingly, is the “right” answer to exactly this problem. From the manpage :

This command uses a binary search algorithm to find which commit in your project’s history introduced a bug. You use it by first telling it a “bad” commit that is known to contain the bug, and a “good” commit that is known to be before the bug was introduced. Then git bisect picks a commit between those two endpoints and asks you whether the selected commit is “good” or “bad”. It continues narrowing down the range until it finds the exact commit that introduced the change.

We’re going forwards in history and looking for a bugfix rather than a regression, so the good and bad identifiers can be a little confusing. To avoid confusing ourselves too much, we use git bisect ’s alternative old and new terms instead.

We also, of course, need to actually run something on each bisected revision to determine whether it’s old or new . The LLVM project build is, blessedly, straightforward:

$ mkdir build && cd build
$ cmake -G Ninja -DLLVM_TARGETS_TO_BUILD="X86" -DLLVM_ENABLE_PROJECTS="clang" ../llvm
$ cmake --build .
$ ./bin/clang --version
$ ./bin/clang -O1 -g /tmp/testcase.c -o /tmp/testcase
# the -n flag filters DWARF info on DW_AT_name, which in our case is the "stuff" variable
$ ./bin/llvm-dwarfdump -n stuff /tmp/testcase | grep 'DW_AT_location'

(We use Ninja and restrict our backends to x86 for faster compilations; there are other definitions we could add to prune the build but these were sufficient for my purposes.)

Finally, we actually perform the bisection. From the repo root, we begin with:

$ git checkout llvmorg-9.0.0
$ git bisect start
# we know that 9.0 has the bug and 10.0 doesn't, so we can tag them immediately
$ git bisect old
$ git bisect new llvmorg-10.0.0

then, perform our actual test steps on the current revision:

$ cd build
# leverage the CMake cache to speed things up, if it's still valid
$ cmake --build .
$ ./bin/clang -O1 -g /tmp/testcase.c -o /tmp/testcase
$ ./bin/llvm-dwarfdump -n stuff /tmp/testcase | grep 'DW_AT_location'
$ old_or_new=$([[ $? -eq 0 ]] && echo new || echo old)
$ cd ..
$ git bisect ${old_or_new}

…and repeat until no commits are left:

cbf1f3b771c8 : [Debuginfo][SROA] Need to handle dbg.value in SROA pass.

diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index a1d1b3ec84f0..7242eb1d3de2 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -1378,7 +1378,12 @@ void llvm::ConvertDebugDeclareToDebugValue(DbgVariableIntrinsic *DII,
 /// Determine whether this alloca is either a VLA or an array.
 static bool isArray(AllocaInst *AI) {
   return AI->isArrayAllocation() ||
-    AI->getType()->getElementType()->isArrayTy();
+         (AI->getAllocatedType() && AI->getAllocatedType()->isArrayTy());
+}
+
+/// Determine whether this alloca is a structure.
+static bool isStructure(AllocaInst *AI) {
+  return AI->getAllocatedType() && AI->getAllocatedType()->isStructTy();
 }

 /// LowerDbgDeclare - Lowers llvm.dbg.declare intrinsics into appropriate set
@@ -1403,7 +1408,7 @@ bool llvm::LowerDbgDeclare(Function &F) {
     // stored on the stack, while the dbg.declare can only describe
     // the stack slot (and at a lexical-scope granularity). Later
     // passes will attempt to elide the stack slot.
-    if (!AI || isArray(AI))
+    if (!AI || isArray(AI) || isStructure(AI))
       continue;

     // A volatile load/store means that the alloca can't be elided anyway

To explain: the LLVM optimizer includes a pass named SRoA, short for “Scalar Replacement of Aggregates”.

SRoA is run by default on -O1 and higher, and attempts to replace aggregate values (like arrays and structures) with scalar SSA values.

An effective SRoA run requires that the llvm.dbg.declare for a value be lowered to one or more llvm.dbg.value s, as the SSA value may or may not reside at a memory location (it might be in a register, like index ).

The bug occurs because of this lowering behavior: the relevant helper ( LowerDbgDeclare ) fails to check for one of the valid aggregate types (structures) and incorrectly elides the intrinsic without inserting any replacements. The fix is fundamentally a single line; the changes to isArray and addition of isStructure only serve to make it more readable.

Five minutes later, we have a patched LLVM 7 with the correct behavior:

$ ./bin/clang --version
clang version 7.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /store/william/llvm-project/build/./bin

$ ./bin/clang -O1 -g /tmp/testcase.c -o /tmp/testcase
$ ./bin/llvm-dwarfdump -n stuff /tmp/testcase | grep 'DW_AT_location'
/tmp/testcase:  file format elf64-x86-64

0x00000069: DW_TAG_variable
              DW_AT_location    (DW_OP_fbreg -120)
              DW_AT_name        ("stuff")
              DW_AT_decl_file   ("/tmp/testcase.c")
              DW_AT_decl_line   (9)
              DW_AT_type        (0x00000078 "foo")

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

持续交付

持续交付

Jez Humble、David Farley / 乔梁 / 人民邮电出版社 / 2011-10 / 89.00元

Jez Humble编著的《持续交付(发布可靠软件的系统方法)》讲述如何实现更快、更可靠、低成本的自动化软件交付,描述了如何通过增加反馈,并改进开发人员、测试人员、运维人员和项目经理之间的协作来达到这个目标。《持续交付(发布可靠软件的系统方法)》由三部分组成。第一部分阐述了持续交付背后的一些原则,以及支持这些原则的实践。第二部分是本书的核心,全面讲述了部署流水线。第三部分围绕部署流水线的投入产出讨......一起来看看 《持续交付》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具