Dumbindent: When 93% of the Time was Spent in Clang-Format

栏目: IT技术 · 发布时间: 4年前

内容简介：Summary: The Wuffs compiler outputs C code. When compiling its standard library, over 93% of the time (2.680 out of 2.855 seconds) was spent formatting that C code withAnother difference (compared to Rust’s primary implementation) is that the Wuffs compile

Summary: The Wuffs compiler outputs C code. When compiling its standard library, over 93% of the time (2.680 out of 2.855 seconds) was spent formatting that C code with clang-format . dumbindent is a new command-line tool (and Go package) that formats C code. Its output is not as ‘pretty’, but it can be over 80 times faster than clang-format (0.008 versus 0.668 seconds to format 12k lines of C code).

Generating C Code

Wuffs is a memory-safe programming language and a standard library written in that language. “Why don’t you use X instead?” is a frequently asked question, and Rust is a frequent X . One difference from Rust is that memory safety (e.g. array index bounds checks) is enforced entirely at compile time , not at run time.

Another difference (compared to Rust’s primary implementation) is that the Wuffs compiler is a transpiler , like the very first C++ compiler . It outputs C code, not object code. That’s not without its costs, but one benefit is that the generated C code’s performance then be compared across different C compilers. For cases where Clang/LLVM is 1.3x slower than gcc , we might never have known how well Wuffs (the language) could perform if its implementation went straight to object code via LLVM.

Compilation Times

Another frequent X is something based on theorem provers or SMT solvers, such as Z3 . Table 1 from one Z3 example reports verification times measured in minutes .

In contrast, Wuffs aims for sub-second compilation times, including compile-time proofs of memory safety. However, over the lifetime of the project, compiling its standard library (currently about 8k lines of Wuffs code that becomes 25k lines of C code) had crept up to over a couple of seconds on a mid-range laptop. This is far from the worst experience that programmers face (cue the obligatory “my code’s compiling” XKCD comic), but it’s still a noticable pause in the edit-compile-run cycle.

The final step of generating the C code was formatting it. The title and summary of this blog post has already spoiled the punchline. After making that final step optional, it turned out that over 93% of the time (2.680 out of 2.855 seconds) was not spent parsing, type checking, bounds checking, analyzing data flow, generating code or any of the other things you’d normally learn about in a compiler textbook. It was spent formatting that C code with clang-format .

One reason that it took so long was that the generated code was formatted twice. Each package in Wuffs’ standard library (e.g. GIF, JSON and ZLIB) is compiled and formatted separately. The per-package output is then amalgamated ( similar to SQLite ) into a single file C library , and formatted again. That second formatting was mostly redundant and was easily made entirely redundant. Removing that second formatting almost halved the time spent generating the Wuffs standard library’s C form.

A similar optimization , avoiding re-formatting the already-formatted, hand-written Wuffs-C interop code, knocked off another chunk of time. Nonetheless, those two changes only brought 93% down to 81%.

Formatting Is Hard

clang-format solves a hard problem: given an arbitrary C program as input (perhaps an invalid one, containing syntax errors), produce something that looks consistent and ‘pretty’, without altering the semantics of that program.

Because C/C++ became popular long before auto-formatters became popular (and e.g. part of pre-submit hooks), there are many different (and incompatible) C/C++ styles in use. clang-format is not a bad program. It’s a big program (for something that ‘merely’ takes text input and prints text output). Its source code weighs over 20k lines of C++ code, excluding its unit tests and dependencies. It has over 100 configuration options (which is still far less than uncrustify ’s 735 configuration options ):

$ clang-format-9 -dump-config | wc -l
127

Code formatting not just a hard problem, it’s really hard. Bob Nystrom’s a smart programmer. Crafting Interpreters is a fantastic book. His “What Color is Your Function?” article is often cited, and asynchronous programming is famously difficult. But he called dartfmt , a source code formatter, “The Hardest Program I’ve Ever Written” . Why is formatting hard? Bob says:

If every statement fit within the column limit of the page, yup. It’s a piece of cake. (I think that’s what gofmt does.) But our formatter also keeps your code within the line length limit. That means adding line breaks (or “splits” as the formatter calls them), and determining the best place to add those is famously hard… The search space we have to cover is exponentially large, and even ranking different solutions is a subtle problem.

Column limits make it essentially hard. More on that later.

Dumbindent

Wuffs doesn’t need a big formatter. It doesn’t need to handle a hundred different hand-written C/C++ styles, only the C that it automatically generates itself. It only needs a “piece of cake”,low INT, high WIS, gofmt -like solution. The ‘unformatted’ C code, by construction, already has line breaks in sensible places. The formatter only needs to fix up the horizontal formatting (i.e. indentation) due to nested {} braces and () parentheses.

Hence dumbindent was born, a dumb-but-fast formatter for C (and C-like) programs. Just like building Lego, when building software, sometimes the dumb approach can solve the problem , beautiful in its own way.

dumbindent has no column limit. It will not break a long line into two medium ones nor merge two short lines into one medium one. It takes existing lines as they are, and only shifts them left or right. The implementation is under 0.6k lines of Go code. Here’s an animation of how it works on some nonsensical, slightly-badly formatted input:

Dumbindent: When 93% of the Time was Spent in Clang-Format

If you want to linger on individual animation frames:

Replacing “run it through clang-format ” with “run it through dumbindent ” made Wuffs’ generated C code slightly less ‘pretty’, but doing so was notably faster and formatting time dropped from 81% to something negligible. Subjectively, Wuffs’ edit-compile-run cycle felt snappier and happier.

A Command-Line Tool

Another data point takes Wuffs’ amalgamated file (here, the 12k lines of C code from an older but unchanging Wuffs release), and formats it again. On this task dumbindent was 70 times faster than clang-format-5.0 , and 80 times faster than clang-format-9 :

$ wc release/c/wuffs-v0.2.c
 11858  35980 431885 release/c/wuffs-v0.2.c

$ time dumbindent                               < release/c/wuffs-v0.2.c > /dev/null
real    0m0.008s
user    0m0.005s
sys     0m0.005s

$ time clang-format-9                           < release/c/wuffs-v0.2.c > /dev/null
real    0m0.668s
user    0m0.618s
sys     0m0.032s

$ time clang-format-9 -style='{ColumnLimit: 0}' < release/c/wuffs-v0.2.c > /dev/null
real    0m0.641s
user    0m0.585s
sys     0m0.037s

Giving clang-format-9 no column limit at all helped a little, but not a lot. I don’t know exactly what clang-format-9 was doing with its time, but perhaps whatever it was isn’t essentially hard, only accidentally hard (and therefore possibly fixable?). That investigation is for another time, though, and most probably for another person.

If you want to try the dumbindent command-line tool yourself, after installing Go , it should suffice to run:

$ go get github.com/google/wuffs/cmd/dumbindent

SQLite

Trying a similar comparison on SQLite’s amalgamated C file (230k lines of C code) was even more dramatic :

$ wc sqlite-amalgamation-3320200/sqlite3.c
 229616 1049648 8115947 sqlite-amalgamation-3320200/sqlite3.c

$ time dumbindent     < sqlite-amalgamation-3320200/sqlite3.c > /dev/null
real    0m0.137s
user    0m0.075s
sys     0m0.034s

$ time clang-format-9 < sqlite-amalgamation-3320200/sqlite3.c > /dev/null
LLVM ERROR: out of memory
Stack dump:
0.      Program arguments: clang-format-9
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x1f)[0x7ff55208135f]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm3sys17RunSignalHandlersEv+0x50)[0x7ff55207f780]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(+0xa38761)[0x7ff552081761]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7ff55143c890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7ff54e2f3e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7ff54e2f5801]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm22report_bad_alloc_errorEPKcb+0x93)[0x7ff551fe60a3]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN4llvm23SmallVectorTemplateBaseIN5clang6format13UnwrappedLineELb0EE4growEm+0xbe)[0x7ff550e007be]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13TokenAnalyzer20consumeUnwrappedLineERKNS0_13UnwrappedLineE+0x121)[0x7ff550dffb21]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format19UnwrappedLineParser5parseEv+0x119)[0x7ff550e10089]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13TokenAnalyzer7processEv+0xcf)[0x7ff550dfee2f]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13guessLanguageEN4llvm9StringRefES2_+0x2d8)[0x7ff550de2728]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format8getStyleEN4llvm9StringRefES2_S2_S2_PNS1_3vfs10FileSystemE+0x59)[0x7ff550de2859]
clang-format-9[0x4064ef]
clang-format-9[0x405688]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7ff54e2d6b97]
clang-format-9[0x40509a]
Aborted (core dumped)
real    0m25.782s
user    0m18.505s
sys     0m3.049s

Caveats

dumbindent does not solve all the problems that clang-format or other formatters do. It does not parse the input as C/C++ source code.

In particular, it does not solve C++’s most vexing parse or otherwise determine whether "x*y" is a multiplication or a type definition (where y is a pointer-to- x typed variable or function argument, such as "int*p" ). For a type definition, where other formatting algorithms would re-write around the "*" as either "x* y" or "x *y" , dumbindent will not insert spaces.

Similarly, dumbindent will not correct this mis-indentation:

if (condition)
  goto fail;
  goto fail;

Instead, when automatically or manually generating the input for dumbindent , it is recommended to always emit {} curly braces, even for what would otherwise be ‘one-liner’ if statements.

Having said that, dumbindent is available as the previously mentioned command-line tool and as a Go package . If it works for you, great. If it doesn’t work for you, don’t use it. :-)

On Related Work

This blog post is critical of other software, especially clang-format . To be clear, Rust, Clang, LLVM, Z3, etc. are not bad technologies. They are great technologies that solve real and important problems, and have orders of magnitude more users that Wuffs and dumbindent do. They’re also solving similar-but-different problems in different contexts and coming from different histories. Software is not a zero-sum game. Engineering is trade-offs.

Published: 2020-06-15

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Dumbindent: When 93% of the Time was Spent in Clang-Format

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

大学的终结

[美] 凯文·凯里（Kevin Carey） / 朱志勇、韩倩 / 人民邮电出版社 / 2017-2-28 / 59.00

你了解目前全球高等教育的现状吗？你知道高等教育的未来是什么样的吗？你听说过泛在大学吗？翻开本书，了解大学的过去、现在与未来。《大学的终结：泛在大学与高等教育革命》一书由美国著名教育作家凯文? 凯里倾情打造。作者在书中详细论述了美国大学的历史变迁、大学的本质、大学的未来、信息技术与教育的关系、泛在大学的定义、传统大学在大趋势下的挣扎，以及未来高等教育的学历认证与呈现形式。本书作者用缜密的逻辑......一起来看看《大学的终结》这本书的介绍吧!

码农工具

Dumbindent: When 93% of the Time was Spent in Clang-Format

Generating C Code

Compilation Times

Formatting Is Hard

Dumbindent

A Command-Line Tool

SQLite

Caveats

On Related Work

大学的终结

SHA 加密

RGB HSV 转换

RGB CMYK 转换工具