如何Python字典哈希查找工作?字典、工作、Python

2023-09-11 03:08:15 作者:陌上柳絮傾城雪﹏

如何Python字典查找算法内部工作?

How do Python dictionary lookup algorithms work internally?

mydi['foo'] 

如果该字典有百万而言,是一个树搜索执行?我希望在关键的字符串,或字典的大小,长度方面的性能?也许馅一切都变成一本字典是一样好写树搜索索引规模500万串?

If the dictionary has 1,000,000 terms, is a tree search executed? Would I expect performance in terms of the length of the key string, or the size of the dictionary? Maybe stuffing everything into a dictionary is just as good as writing a tree search index for strings of size 5 million?

推荐答案

下面是一些伪code更接近实际发生的情况。想象一下字典有一个数据包含键属性,值对和尺寸这是分配单元的数目。

Here's some pseudo-code closer to what actually happens. Imagine the dictionary has a data attribute containing the key,value pairs and a size which is the number of cells allocated.

def lookup(d, key):
    perturb = j = hash(key)
    while True:
        cell = d.data[j % d.size]
        if cell.key is EMPTY:
            raise IndexError
        if cell.key is not DELETED and (cell.key is key or cell.key == key):
            return cell.value
        j = (5 * j) + 1 + perturb
        perturb >>= PERTURB

扰乱价值保证的散列code的所有位解决哈希冲突时,最终被使用,但一旦它已经退化到0 (5 * j)条+1 将最终触及表中的所有单元格。

The perturb value ensures that all bits of the hash code are eventually used when resolving hash clashes but once it has degraded to 0 the (5*j)+1 will eventually touch all cells in the table.

尺寸总是比实际使用这样的哈希确保最终能够打空单元格时,该键不存在细胞的数量大得多(而且通常应当打一pretty的快)。还有一个已删除的值要键表示这不应该终止搜索的小区但不是当前正在使用。

size is always much larger than the number of cells actually used so the hash is guaranteed to eventually hit an empty cell when the key doesn't exist (and should normally hit one pretty quickly). There's also a deleted value for a key to indicate a cell which shouldn't terminate the search but which isn't currently in use.

至于你的问题关于密钥字符串的长度,散列串会考虑所有的字符串中的字符,而是一个字符串,也有用于存储计算哈希值的字段。所以,如果你使用不同的字符串做查找每次字符串长度可能有一个轴承,但如果你有一个固定的按键和再使用相同字符串的哈希将不会被重新计算第一次使用后, 。 Python中会从这样的好处,因为大多数名称查找包含字典和每个变量的一个副本或属性名称在内部存储,所以每次访问的属性时间 XY 有一个字典查找,但没有一个调用的哈希函数。

As for your question about the length of the key string, hashing a string will look at all of the characters in the string, but a string also has a field used to store the computed hash. So if you use different strings every time to do the lookup the string length may have a bearing, but if you have a fixed set of keys and re-use the same strings the hash won't be recalculated after the first time it is used. Python gets a benefit from this as most name lookups involve dictionaries and a single copy of each variable or attribute name is stored internally, so every time you access an attribute x.y there is a dictionary lookup but not a call to a hash function.