最好的哈希算法在哈希冲突的条款和字符串性能最好的、字符串、算法、冲突

2023-09-10 22:43:18 作者:栖鸦不定

什么是最好的哈希算法,如果我们有如下的优先级(按顺序):

What would be the best hashing algorithm if we had the following priorities (in that order):

在最小哈希冲突 性能

有不必是安全的。基本上我试图创建基于某些对象属性的组合索引。 所有的属性都是字符串

It doesn't have to be secure. Basically I'm trying to create an index based on a combination of properties of some objects. All the properties are strings.

到C#实现的任何引用将AP preciated。

Any references to c# implementations would be appreciated.

推荐答案

忘掉所谓最好的。无论哪个哈希算法的人可能会想出,除非你有一组数据很有限,需要被散列,每一个算法,它有很好的表现,平均可以成为完全无用的,如果只被喂以正确的(或者从你的角度错)数据。

Forget about the term "best". No matter which hash algorithm anyone might come up with, unless you have a very limited set of data that needs to be hashed, every algorithm that performs very well on average can become completely useless if only being fed with the right (or from your perspective "wrong") data.

与其浪费太多的时间思考如何让散更无碰撞,而无需使用过多的CPU时间,我宁愿开始考虑如何使冲突问题较少。例如。如果每个哈希桶,其实是一个表,在这个表中的所有字符串(即有一个碰撞)按字母顺序排序,你可以水桶表中使用二进制搜索(这是唯一为O(log n))的搜索,这意味着,即使当每一秒钟哈希桶有4碰撞,您的code将仍然有不俗的表现(这将是一个有点慢相比,无冲突的表,但没有那么多)。这里的一大好处是,如果你的表是足够大,你的哈希不是过于简单,导致相同的哈希值的两个字符串通常会变得面目全非(因此二进制搜索可以停在平均也许一个或两个字符后比较字符串;尽一切比较非常快)

Instead of wasting too much time thinking about how to get the hash more collision-free without using too much CPU time, I'd rather start thinking about "How to make collisions less problematic". E.g. if every hash bucket is in fact a table and all strings in this table (that had a collision) are sorted alphabetically, you can search within a bucket table using binary search (which is only O(log n)) and that means, even when every second hash bucket has 4 collisions, your code will still have decent performance (it will be a bit slower compared to a collision free table, but not that much). One big advantage here is that if your table is big enough and your hash is not too simple, two strings resulting in the same hash value will usually look completely different (hence the binary search can stop comparing strings after maybe one or two characters on average; making every compare very fast).

其实我有一个情况我在这里使用二进制搜索排序表内直接搜索之前,竟然是比散列快!虽然我的散列算法很简单,花了相当长的一段时间哈希值。性能测试表明,只有当我得到超过约700-800项,散列确实快于二进制搜索。但是,作为表永远增长大于256项无论如何和作为平均表低于10项,基准清楚地表明,每一个系统中,每一个CPU上,二进制搜索更快。这里,事实上,通常已经比较数据的第一个字节是足以导致下一个的bsearch迭代(作为数据以前是在第一个两字节的非常不同的话)原来是一个大的优势。

Actually I had a situation myself before where searching directly within a sorted table using binary search turned out to be faster than hashing! Even though my hash algorithm was simple, it took quite some time to hash the values. Performance testing showed that only if I get more than about 700-800 entries, hashing is indeed faster than binary search. However, as the table could never grow larger than 256 entries anyway and as the average table was below 10 entries, benchmarking clearly showed that on every system, every CPU, the binary search was faster. Here, the fact that usually already comparing the first byte of the data was enough to lead to the next bsearch iteration (as the data used to be very different in the first one to two byte already) turned out as a big advantage.

因此​​,要总结:我会带一个体面散列算法,不会导致平均过多的碰撞和相当快,而优化(我甚至会接受一些更多的冲突,如果它只是非常快!)我的code如何得到最小的性能损失,一旦碰撞确实发生(他们会!他们将除非你的散列空间至少比你的数据空间相等或更大,你可以一个唯一的哈希值映射到每个可能的组数据)。

So to summarize: I'd take a decent hash algorithm, that doesn't cause too many collisions on average and is rather fast (I'd even accept some more collisions, if it's just very fast!) and rather optimize my code how to get the smallest performance penalty once collisions do occur (and they will! They will unless your hash space is at least equal or bigger than your data space and you can map a unique hash value to every possible set of data).