Hashing it Out

栏目: IT技术 · 发布时间: 4年前

内容简介：In my experience, when students learn data structures they (1) learn how they work theoretically (2) use off-the-shelf data structures to solve problems and (3) sometimes implement these data structures from scratch. However, I don’t often see classes expl

In my experience, when students learn data structures they (1) learn how they work theoretically (2) use off-the-shelf data structures to solve problems and (3) sometimes implement these data structures from scratch. However, I don’t often see classes exploring the off-the-shelf implementations, and the interesting practical considerations that come along with them.

When I was a TA for 6.006 , I had students answer the following questions about Python dictionaries. I hope they were a fun way to experiment with hash tables “in the wild”–I definitely learned a lot while preparing this! Solutions are at the end.

Background

Recall that Python implements dictionaries using a hash table with open addressing . Long story short, dictionaries have an underlying array that stores key-value pairs. When inserting a key, Python uses hash values to “probe” through this array in a specific sequence until it finds an open slot for that key. This post assumes you have a basic understanding of open addressing. If you’d like a refresher on hash table concepts, this might help.

Also, the internals of a Python dictionary aren’t a mystery: you can check out the source code ! If you don’t want to read through everything, there are a bunch of resources explaining Python dictionaries, like this and this .

Problems

Problem 1

Use a Python program to determine how the size of a dictionary, in bytes, changes as you insert keys. How does this affect insertion time, and why?

Problem 2

Assume you have inserted \(e\) entries into a hash table that has the capacity to store \(c\) entries. Define the load factor of the hash table as \(e/c\). In addition, define the compactness factor of the hash table as (number of bytes allocated to the \(e\) entries) divided by (number of bytes allocated to the entire table).

Recall that the size of a hash table periodically increases to keep the load factor below a certain threshold. What is the maximum compactness factor of a Python dictionary, and how does that compare to the maximum load factor? How has this changed over different versions of Python, and why? It might help to read the CPython source .

Problem 3

Consider object A , which returns a random integer in the range \([1, n]\) on every call to __hash__ . What’s so bad about this? Provide two examples of problematic code using A objects and dictionaries, and explain what goes wrong.

import random    

class A:
    def __init__(self, n):
        self.n = n    

    def __hash__(self):
        return random.randint(1, self.n)

Problem 4

Object B differs from A in that we return a deterministic hash value in the range \([1, n]\). For various values of \(n\) and \(k\), compute how long it takes to insert \(k\) B(n) objects into a dictionary. Does this match what you theoretically expect? How does this compare to the “real world,” and what explains the difference?

import random    

class B:
    def __init__(self, n):
        self.hash = random.randint(1, n)    

    def __hash__(self):
        return self.hash

Solutions

Here are brief responses to these questions: unless otherwise stated, I’m running these on my Macbook Pro using Python 3.7.7.

Problem 1

First, we can use the following code to measure when the size of a Python dictionary changes:

import sys

d = {}
N = 1000

last_size = sys.getsizeof(d)

for i in range(N):
    d[i] = "foo"
    if sys.getsizeof(d) > last_size:
        print("key", i, "new size", sys.getsizeof(d))
        last_size = sys.getsizeof(d)

This prints out the following, indicating that the table size doubles on certain insertions (it isn’t exact because dictionaries have a 96-byte overhead).

key 5 new size 376
key 10 new size 656
key 21 new size 1192
key 42 new size 2288
key 85 new size 4712
key 170 new size 9328
key 341 new size 18536
key 682 new size 36976

We can also measure insertion time: here, we add \(N\) values to a dictionary, recording how long it takes to add each key. We add up the times over multiple trials to reduce noise:

import time

N = 1000

# times[i] stores how long it takes to add 
# i to the dictionary
times = [0] * N

# Run multiple trials to reduce noise
for _ in range(10000):
    d = {}
    # For each i, measure how long it takes
    # to add i and add the result to times[i]
    for i in range(N):
        start_time = time.time()
        d[i] = "foo"
        t = time.time() - start_time
        times[i] += t
  
# Find the 5 values of i that took the longest 
# time to insert
top = sorted(list(range(N)), key=lambda i: -times[i])
for i in top[:5]:
    print(i, times[i])

This returns the following:

682 0.027842044830322266
341 0.015052318572998047
170 0.0094757080078125
85 0.006536006927490234
42 0.0048770904541015625

Unsurprisingly, these correspond to when the dictionary size doubles! This is because Python needs to allocate a new array and reinsert all existing keys into that array, which explains why insertion into a Python dictionary is actually amortized \(\Theta(1)\) in expectation.

Problem 2

First, looking at dict-common.h and pyport.h , we can deduce that the size of a dictionary entry is 24 bytes. Therefore, given a dictionary d , its compactness factor is equal to len(d) * 24 / sys.getsizeof(d) . This ratio is maximized right before the dictionary resizes. To measure this, we can modify our script from Problem 1:

import sys

d = {}
N = 100000

last_size = sys.getsizeof(d)

for i in range(N):
    compactness = len(d) * 24.0 / sys.getsizeof(d)
    d[i] = "foo"
    if sys.getsizeof(d) > last_size:
        print("key", i, "compactness", compactness)
        last_size = sys.getsizeof(d)

On Python 2.7, this prints the following:

key 5 compactness 0.428571428571
key 21 compactness 0.480916030534
key 85 compactness 0.608591885442
key 341 compactness 0.651177593889
key 1365 compactness 0.66272859686
key 5461 compactness 0.665677948885
key 21845 compactness 0.666419223299
key 87381 compactness 0.666604789308

The compactness factor clearly approaches \(2/3\), which matches the load factor stated in dictobject.c . (Also, note that different versions of Python have different table resizing schemes.)

On Python 3.7, this prints the following:

key 5 compactness 0.4838709677419355
key 10 compactness 0.6382978723404256
key 21 compactness 0.7682926829268293
key 42 compactness 0.8456375838926175
key 85 compactness 0.8916083916083916
key 170 compactness 0.865874363327674
key 341 compactness 0.8773584905660378
key 682 compactness 0.8830384117393181
key 1365 compactness 0.8859800951968845
key 2730 compactness 0.8874200888503629
key 5461 compactness 0.888160034695869
key 10922 compactness 0.8885213005396317
key 21845 compactness 0.8887065715603049
key 43690 compactness 0.7999243224109415
key 87381 compactness 0.7999627701453185

Weird. The compactness factor approaches \(8/9\), but suddenly shifts to \(4/5\). In fact, the maximum load factor is still \(2/3\), but Python dictionaries have gotten more compact .

So what’s changed? In the old implementation, dictionaries stored a single array of 24-byte entries. If a dictionary has capacity \(c\) and a maximum load factor of \(2/3\), then its maximum compactness factor is \((24 \cdot \frac{2}{3} c)/(24 \cdot c) = 2/3\).

This isn’t very efficient: if a third of the entries are always going to be unoccupied, we’re wasting 24 bytes on each of them. Python 3.6 solves this by storing an array of \(c\) indices , which it uses to access a separate array of \(\frac{2}{3} c\) 24-byte entries. When inserting a new key, Python (1) adds the key-value pair to the entries array (2) probes through the indices array until it finds an open slot and (3) stores the index of the new entry.

The size of these indices depends on the the value of \(c\): 1 byte if \(c < 2^8 = 256\), 2 bytes if \(c < 2^{16}= 65,536\) and so on. So for a dictionary with capacity \(2^8 \le c < 2^{16}\), the maximum compactness factor will be \((24 \cdot \frac{2}{3} c) / (24 \cdot \frac{2}{3} c + 2 \cdot c) = 8/9\), which matches our observations. This is a pretty sizable improvement!

Problem 3

In this example, setting d[key] then retrieving d[key] usually results in a KeyError : if the hash value changes, Python will be unable to locate the key.

d = {}
key = A(5)
d[key] = "foo"
print(d[key])

Here, setting the value of d[key] a second time should overwrite the first instance. If the hash value changes, however, it will insert key a second time and this program will print 2 .

d = {}
key = A(5)
d[key] = "foo"
d[key] = "bar"
print(len(d))

Moral of the story: hash functions should be deterministic.

Problem 4

If we insert \(k\) keys into a dictionary with \(n\) hash values, we expect \(k/n\) keys to be assigned to each hash value. For each value of \(n\), inserting these \(k/n\) values should take \(1 + 2 + \dots + k/n = \Theta(k^2/n^2)\) time. This is because every time we add a new key, we need to probe through all existing keys with the same hash value before we find an open slot. Adding over all \(n\), the time it takes to insert all \(k\) keys should be \(\Theta(k^2/n)\) in expectation. Note that probe sequences can overlap, so this analysis isn’t 100% accurate–we’ll get to that later!

We can verify this using the following code, which adds \(k\) B(n) objects to a table and measures how long it takes:

import time 

def create_dict(n, k):
    d = {}
    start_time = time.time()
    for i in range(k):
        d[B(n)] = i
    return time.time() - start_time

First, for a few values of \(n\) we can see how insertion time varies with \(k\). Indeed, a regression will verify that insertion time increases quadratically with \(k\).

Hashing it Out

Second, for a fixed value of \(k = 2000\) we can see how insertion time varies with \(n\). It shouldn’t be hard to verify that time is proportional to \(1/n\).

Hashing it Out

Cool, these are consistent with our predictions!

So why doesn’t Python run into this slowdown? In short, hash collisions rarely happen in practice. Python also makes sure to use every bit of the hash value, as opposed to naively taking hash values mod the length of the array (see the perturb logic in dictobject.c ).

In our example, there were so many hash collisions that insertion time was dominated by two keys having the exact same probe sequence. In the real world, however, insertion time is dominated by different probe sequences overlapping. It isn’t trivial , but one can prove that inserting \(k\) keys into a dictionary takes \(\Theta(k)\) time in expectation.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Hashing it Out

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

系统程序员成长计划

李先静 / 人民邮电出版社 / 2010-04 / 45.00

在学习程序开发的过程中，你是否总是为自己遇到的一些问题头疼不已，你是否还在为写不出代码而心急如焚？作为软件开发人员，你是否时时为自己如何成为一名合格的程序员而困惑不已？没关系，本书将为你排忧解难。这是一本介绍系统程序开发方法的书。书中结合内容详尽的代码细致讲述了不少底层程序开发基础知识，并在逐步深入的过程中介绍了一些简单实用的应用程序，最后还讲述了一些软件工程方面的内容，内容全面，语言生动......一起来看看《系统程序员成长计划》这本书的介绍吧!

码农工具

Hashing it Out

Background

Problems

Problem 1

Problem 2

Problem 3

Problem 4

Solutions

Problem 1

Problem 2

Problem 3

Problem 4

系统程序员成长计划

JS 压缩/解压工具

MD5 加密

Markdown 在线编辑器