内容简介:此文原是「研习小组」中的这段时间我看了陈同学的一篇「
此文原是「研习小组」中的 一篇主题 ,当时我在这个主题上花了一番精力,有一些收获,在这里记录一下,备忘。
正文
这段时间我看了陈同学的一篇 文章 ,里面提到了 ThreadLocal ,它的源码我以前没看过,所以就借着这个机会看了一下,我发现了在 ThreadLocal 里的 ThreadLocalMap 当中,使用了一种被称之为 斐波那契散列
(存疑)的哈希函数,他的大致过程是:
-
每次用 0x61c88647 这个数累加。(这里对应哈希函数的一般操作步骤的)
// 这个数可以这样计算出来 ~(2^32 / 1.618) + 0b1 (这里 ^ 代表求幂操作) BigDecimal goldenRatioResult = BigDecimal.valueOf(Math.pow(2, 32)).divide(new BigDecimal("1.618"), 0, ROUND_HALF_UP); int hashIncrenment = ~goldenRatioResult.intValue() + 0b1; // 等效于 Math.abs() 结果是 1640531527 也就是十六进制的 0x61c88647
-
得到的结果按位和 ThreadLocalMap 的容量-1 做 & 操作。(这里实际上对应了哈希函数的一般操作步骤的,也就是将第一步计算的值和哈希表的容量做取模操作。另外,这里ThreadLocalMap 的容量必须为2的幂,之所以有这样的规定,是因为,当求解 x % y 时,如果y = 2^n 那么取模操作可以转换为按位与操作 x & (y - 1),效率会比取模操作高,取模操作可以看作是除操作,而除操作是性能比较低的)
/** * The difference between successively generated hash codes - turns * implicit sequential thread-local IDs into near-optimally spread * multiplicative hash values for power-of-two-sized tables. */ private static final int HASH_INCREMENT = 0x61c88647; /** * Returns the next hash code. */ private static int nextHashCode() { return nextHashCode.getAndAdd(HASH_INCREMENT); }
其中需要注意的是1.618这个数又叫 黄金分割 (黄金分割也可以是0.618,0.618和1.618互为倒数,也就是1/1.618 = 0.618),而这个黄金分割又是可以由斐波那契数列推导出来的:
Fibonacci numbers: F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1. n a(n) 0 0 1 1 2 1 3 2 4 3 5 5 6 8 7 13 8 21 9 34 10 55 11 89 12 144 13 233 14 377 15 610 16 987 17 1597 18 2584 19 4181 20 6765 21 10946 22 17711 23 28657 24 46368 25 75025 26 121393 27 196418 28 317811 29 514229 30 832040 31 1346269 32 2178309 33 3524578 34 5702887 35 9227465 36 14930352 37 24157817 38 39088169 39 63245986 40 102334155 102334155 / 63245986 = 1.61803 从上边观察,对于考后的 n ,a(n+1) / a(n) 约等于 1.618
「 Golden ratio 」
「 Fibonacci number 」
那么问题就来了:
- 为什么一定要用黄金分割来累加?我看到有个视频说用黄金分割可以让hash值分散(存疑),但是我有一个疑问,,不照样可以让 hash 分开吗?
- 另外 ThreadLocalMap 的 hash 最后还需要用按位与的方式用掩码消除高位,那么为什么消除高位后仍能保持得到hash值的分散呢。(存疑,有可能压根不是仍然保持分散)
概念与理论
哈希函数的一般分类
Division Based Hash Functions h(x) = x % M Bit-Shifting Hash Functions Using the Middle Square Method Multiplicative Hash Functions
斐波那契哈希属于Multiplicative Hash Functions中的一种实现
In the multiplicative method, the multiplier A must be carefully chosen to scatter the numbers in −1 the table. A very good choice for the constant A is φ W , where φ, known as the golden ratio , is the positive root of the polynomial x 2 − x − 1 This is because x is a solution to the polynomial if and only if x 2 − x − 1=0 iff x 2 − x =1 iff x − 1 1 = x In other words, a solution x has the property that its inverse is x − 1. The solution, φ, is called the golden ratio because it arises in nature in an astonishing number of places and because the ancient Greeks used this ratio in the design of many of their buildings, considering it a divine −1 2 proportion . Thus, φ and φ = φ − 1 are both roots of x x1. φ is also the value to which th −1 f n /f n−1 converges as n → ∞, where f n is the n Fibonacci number. Since φ = φ − 1, it is approximately 0.6180339887498948482045868343656. (Well approximately depends on your notion of approximation, doesn’t it?) −1 When we let A = φ W as the multiplicative constant, the multiplicative method is called Fibonacci hashing . The constant has many remarkable and astonishing mathematical properties, but the property that makes it a good factor in the above hash function is the following fact. −1 −1 −1 First observe that the sequence of values φ , 2φ , 3φ , … lies entirely in the interval (0, 1). Remember that the curly braces mean, the fractional part of, so for example, 2φ −1 ≈ −1 −1 1.236067977 and ≈ 0.236067977. 2φ The rst value, two segments whose lengths are in the golden ratio.
哈希函数的一般步骤
Main article: Hash function
The idea of hashing is to distribute the entries (key/value pairs) across an array of _buckets_index_ that suggests where the entry can be found:
index = f(key, array_size)> Often this is done in two steps:
hash = hashfunc(key)
index = hash % array_size> In this method, the _hash_reduced_ to an index (a number between 0
and array_size − 1
) using the modulo operator ( %
).
In the case that the array size is a power of two , the remainder operation is reduced to masking ), which improves speed, but can increase problems with a poor hash function. [5]
「 Hash table 」
Modulo operations might be implemented such that a division with a remainder is calculated each time. For special cases, on some hardware, faster alternatives exist. For example, the modulo of powers of 2 can alternatively be expressed as a bitwise AND operation:
x % 2 == x & (2 - 1)
Examples (assuming _x_ is a positive integer):
x % 2 == x & 1`
x % 4 == x & 3 x % 8 == x & 7
In devices and software that implement bitwise operations more efficiently than modulo, these alternative forms can result in faster calculations.[[11]](https://en.wikipedia.org/wiki/Modulo_operation#cite_note-11) [Optimizing](https://en.wikipedia.org/wiki/Compiler_optimization) [compilers](https://en.wikipedia.org/wiki/Compiler) may recognize expressions of the form
expression % constant where
constant is a power of two and automatically implement them as
expression & (constant-1) . This can allow writing clearer code without compromising performance. This optimization is not possible for languages in which the result of the modulo operation has the sign of the dividend (including C), unless the dividend is of an [unsigned](https://en.wikipedia.org/wiki/Signedness) integer type. This is because, if the dividend is negative, the modulo will be negative, whereas
expression & (constant-1)` will always be positive.
「 Modulo operation 」
怎样是好的哈希函数
Properties of a Good Hash Function hash To means to chop up or make a mess of things, liked hashed potatoes. A hash function is supposed to chop up its argument and reconstruct a value out of the chopped up little pieces. Good hash functions • make the original value intractably hard to reconstruct from the computed hash value, • are easy to compute (for speed), and • randomly disperse keys evenly throughout the table, making sure that no two keys map to the same index.
Choosing a hash function
A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve._ citation needed _ [6]
A basic requirement is that the function should provide a uniform distribution ) of hash values. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson’s chi-squared test for discrete uniform distributions. [7] [8]
The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two . Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number . [9] The modulus operation may provide some additional mixing; this is especially useful with a poor hash function.
For open addressing schemes, the hash function should also avoid clustering , the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash [3] is claimed to have particularly poor clustering behavior. [9]
好的哈希函数应该能做到“discrete uniform distributions”,也就是“离散均匀分布”,而斐波那契哈希是可以做到很好的“离散均匀分布”的,取一个别的值不一定能做到(也并不一定做不到),这个可以通过一个实验观察到。下图为我写的一个DEMO,红色虚线框中是斐波那契散列,其他的为其他数做的对比:(在 0 - 1 内的分布)
以下为 ThreadLocalMap 自身的 hash 函数的分布情况:(可以看出来,当值为 1.618 时,是“离散均匀分布”的,其他值,也能做到一定的离散均匀分布,但是有的值就做不到)(存疑)
Re: Golden Ratio & Hash Tables
Posted: Nov 2, 2002 2:58 AM
Click to see the message monospaced in plain text Plain Text Click to reply to this topic Reply
“Alexei A. Frounze” alexfru@chat.ruwrote in message news:
apop0i$3ib1k$2@ID-57378.news.dfncis.de
…
Can anybody enlighten me why golden ratio is so widely used when doing hash tables? Is it the only possible and effective value? I’ve not found anything about this particular choice in the Sedgewick’s book, just the fact that it’s used. Any simple explanation? I believe Knuth has explanation, but I don’t have the book handy and don’t know when I’ll be able to read it. Can you explain why (sqrt(5)-1)/2, but not let’s say (sqrt(3)-1)/2? Does this really have anything to do with Fibonacci numbers, packing of seeds and thelike? TIA
The usual explanation is that it provides the best dispersion of
consecutive keys in a multiplicative hash table: that was Knuth’s
reason for advocating it.
For a hash table of size 2^N, an index for each key K can be
M, where M is a prime number close to
R, R being the inverse golden ratio (sqrt(5)-1)/2. Keep the Nmost significant bits as the hash index. Knuth’s finding was that the
dispersion of indexes for a sequence of consecutive keys is maximized
when M is chosen this way, thus a multiplicative hash table with a
dense set of keys will have the fewest possible collisions when M
approx= 2^N * R.
–
Chris Green
「 Topic: Golden Ratio & Hash Tables 」
x mod 2^k is not a desirable function because it does not depend on all the bits in the binary representation of _x_
Similarly, it is often tempting to let _M_ be a power of two. E.g., for some integer _k>_1. In this case, the hash function
simply extracts the bottom _k_ bits of the binary representation of _x_. While this hash function is quite easy to compute, it is not a desirable function because it does not depend on all the bits in the binary representation of _x_.
「 Division Method 」
哈希表的冲突处理
• separate chaining • open addressing
Open Addressing In open addressing, there are no separate lists attached to the table. All values are in the table itself. When a collision occurs, the cells of the hash table itself are searched until an empty one is found. Which cells are searched depends upon the speci c method of open addressing. All variations can be described generically by a sequence of functions h 0 (x) h 1 (x) h k (x) = = … = h(x) + f(0, x) % M h(x) + f(1, x) % M h(x) + f(k, x) % M th where h i (x) is the i location tested and f(i, x) is a function that returns some integer value based on the values of i and x. The idea is that the hash function h(x) is rst used to nd a location in the hash table for x. If we are trying to insert x into the table, and the index h(x) is empty, we insert it there. Otherwise we need to search for another place in the table into which we can store x. The function f(i, x), called the collision resolution function , serves that purpose. We search the locations h(x) + f(0, x) % M h(x) + f(1, x) % M h(x) + f(2, x) % M … h(x) + f(k, x)% M until either an empty cell is found or the search returns to a cell previously visited in the sequence. The function f(i, x) need not depend on both i and x. Soon, we will look at a few dierent collision resolution functions. To search for an item, the same collision resolution function is used. The hash function is applied to nd the rst index. If the key is there, the search stops. Otherwise, the table is searched until either the item is found or an empty cell is reached. If an empty cell is reached, it implies that the item is not in the table. This raises a question about how to delete items. If an item is deleted, then there will be no way to tell whether the search should stop when it reaches an empty cell, or just jump over the hole. The way around this problem is to lazy deletion . In lazy deletion, the cell is marked DELETED. Only when it is needed again is it re-used. Every cell is marked as either ACTIVE: it has an item in it EMPTY: it has no item in it DELETED: it had an item that was deleted it can be re-used These three constants are supposed to be de ned for any hash table implementation in the ANSI standard. There are several dierent methods of collision resolution using open addressing: linear probing , quadratic probing , double hashing , and hopscotch hashing .
参考资料
「 Java中Integer越界的处理 」“2的32次方除以1.618这个数,得出来一个结果,每次用这个结果累加”,这个累加操作很快就会造成整数越界
「 关于’二补码’(Two’s Complement) 」 负数转正数或者正数转负数的操作都是用的 二补码
这个操作,这个操作内容就是: 按位取反 + 二进制1
「 Fibonacci Hashing & Fastest Hashtable 」
「 Hashing, Hash Tables and Scatter Tables 」
「 Fibonacci Hashing: The Optimization that the World Forgot (or: a Better Alternative to Integer Modulo) 」这篇是最推荐的,同时还有关于这篇的评论 「 Hacker News 」
「 为什么使用0x61c88647 」
「 Why 0x61c88647? 」
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- 通俗易懂--决策树算法、随机森林算法讲解(算法+案例)
- 限流算法之漏桶算法、令牌桶算法
- 什么是Paxos算法?Paxos算法是区块链核心算法之一
- 一文读懂对称加密算法、非对称加密算法和Hash算法
- 算法(六):图解贪婪算法
- 算法篇03:排序算法
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
UNIX 时间戳转换
UNIX 时间戳转换
RGB HSV 转换
RGB HSV 互转工具