A C++17 STL bug that can be attributed to Knuth, and its 40-year-old fix

栏目: IT技术 · 发布时间: 4年前

内容简介：ThisThe Boyer-Moore algorithm (published 1977) relies on a "delta2" table, but that paper didn't explain how to generate the table. Another paper by Knuth, Morris, and Pratt (also 1977) provided the algorithm for the delta2 table. Actually, it provided two

This fixes #713 , silent bad codegen in an important C++17 feature.

The Boyer-Moore algorithm (published 1977) relies on a "delta2" table, but that paper didn't explain how to generate the table. Another paper by Knuth, Morris, and Pratt (also 1977) provided the algorithm for the delta2 table. Actually, it provided two algorithms producing tables compatible with the usage of delta2: a basic algorithm dd and an improved algorithm dd' . We implemented the latter, and properly translated it from 1-based indexing to 0-based indexing.

However, the published algorithm for dd' was incorrect! This was discovered and fixed by Rytter in 1980, which we weren't aware of until we received a bug report. While the "Rytter correction" was known in the computer science literature, I find it very curious that it isn't constantly mentioned when explaining Boyer-Moore (e.g. Wikipedia's page currently makes no mention of this).

This PR applies this 40-year-old bugfix and significantly expands our test coverage.

<functional>
- Add top-level to . (This is the output array; I kept the name.)
- Rename to to follow the naming in the published algorithms.
- Fix comment typo: doesn't exist, it's .
- Add comments about the history for future programmer-archaeologists.
- Rename to , again following the papers. Note that in the usage below, there is a semantic change: stored 0-based values, while stores 1-based values. This undoes a micro-optimization ( avoided unnecessarily translating between the 0-based and 1-based domains), but with the increased usage of in the Rytter correction, I wanted greater correspondence with the published algorithms in order to verify the implementation.
- can be top-level (we don't reassign/reset the ).
- Obscure bugfix: uses which uses , so we should use to match, not . (This makes a difference only for pathologically fancy types.)
- Change 0-based to 1-based .
- Change 0-based to 1-based .
- Rename 1-based to 1-based . (While the code was correct, I found it confusing that was 0-based in other loops but 1-based in this loop.)
- Note that after these changes, the code closely corresponds to the published algorithms, except that subscripting needs to adjust from 1-based to 0-based indexing.
- Implement the Rytter correction, which replaces the final loop of the KMP77 algorithm.
- For clarity and debug codegen, I extracted a repeated array access into after verifying that this doesn't disturb the algorithm.
P0220R1_searchers/test.cpp
- Add test cases to from Rytter's paper, and other repetitive patterns where the Rytter correction produces different results. Those patterns were "made from scratch" but for the results, I just used the output of the implementation and manually verified selected answers for the AB and ABC categories against my understanding of delta2's meaning (including the unused last entry, see #714 ; I was able to understand why it's 10 for , 2 for , and why it should be 1 for and ).
- Add the test case from #713 , plus hand-selected test cases from the randomized testing below (selected for interesting-looking patterns).
- Add randomized test coverage for both Boyer-Moore and Boyer-Moore-Horspool (the latter is not known to have any bugs). This uses a fully-seeded (for speed, instead of directly using ). It prints out the needle/haystack for any failures, so we don't need the seed printing/reload machinery from . (I'm using for 32-bit performance, since we don't need 64-bit values.) The randomized coverage uses alphabets from to ; the former finds more examples of the bug being fixed here (as it's more likely to create highly repetitive patterns). Expanding the alphabet makes repetition unlikely, which is why I stop at . The test does a few things to improve non-optimized debug performance (reusing and to avoid repeated allocations, using to avoid iterator overhead), although I didn't pursue this to the ultimate limit (e.g. is somewhat costly and could be avoided at the expense of some bias). Finally, as with the other randomized coverage, I print out timing statistics if it takes a long time; on my i7-8700 it's tuned to take ~400 ms which seems reasonable (given the number of configurations, the performance of VMs, and the time needed for compilation).

While this changes the behavior of a function, it is ABI-safe. This doesn't require coordinated changes across functions - the rest of the Boyer-Moore machinery is unchanged, and the layout of the delta2 table is unchanged. We're simply filling it with different values. In the event of mismatch, the linker will either pick the correct or incorrect algorithm, so this can't make things any worse.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

A C++17 STL bug that can be attributed to Knuth, and its 40-year-old fix

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

企业应用架构模式

Martin Fowler、王怀民、周斌 / 王怀民、周斌 / 机械工业出版社 / 2004-7 / 49.00元

本书作者是当今面向对象软件开发的权威，他在一组专家级合作者的帮助下，将40多种经常出现的解决方案转化成模式，最终写成这本能够应用于任何一种企业应用平台的、关于解决方案的、不可或缺的手册。本书获得了2003年度美国软件开发杂志图书类的生产效率奖和读者选择奖。本书分为两大部分。第一部分是关于如何开发企业应用的简单介绍。第二部分是本书的主体，是关于模式的详细参考手册，每个模式都给出使用方法和实现信息，并一起来看看《企业应用架构模式》这本书的介绍吧!

码农工具

A C++17 STL bug that can be attributed to Knuth, and its 40-year-old fix

企业应用架构模式

CSS 压缩/解压工具

随机密码生成器

MD5 加密