是否有可能映射字符串为int比使用HashMap的快?有可能、字符串、HashMap、int

2023-09-10 23:58:56 作者:诗词歌赋.

我知道我不应该优化我的程序的每一个点,所以请考虑这个问题是学院派

我有最多100个字符串,并为他们每个人的整数倍,这样的事情:

  MSFT 1
DELL 2
HP 4
....
美国广播公司58
 

这集是preinitialized这​​意味着一旦创建它永远不会改变。集初始化后,我用它pretty的密集,因此很高兴有快速查找。字符串是pretty的短,最多30个字符。映射 INT 也被限制到100之间1。

目前至少知道字符串preinitialized和永远不会改变它应该是可能的查找哈希函数,其导致一个篮子酮项映射,但可能也有其他黑客。

一种优化我能想象 - 我只能读取第一个符号。例如,如果DELL为开头的D的唯一字符串,我收到了类似的D ***不是我不需要连看书都串!这是obvioslyDELL。这样的查找必须比HashMap中查找显著更快。 (当然这里我认为,我们收到的哈希值,但它并非总是如此唯一的符号)

有没有准备好使用或易于实现我的问题的解决方案?我使用C ++和提升。

UPD 我检查,发现我的股票兑换限额为12个符号,而不是30如上所述。但是其他交易所可能会允许略低较长的符号,因此也同样吸引有算法,将继续努力高达20 charachters长的行情。

解决方案 hashmap中的key是有序的么 深入浅出学Java HashMap

一个哈希表 [1] 在原理上是最快的方式。

您可以但是编译完善哈希函数的因为你知道的全部事实域的时间提前。

有了一个完美的哈希值,但不一定是冲突,所以你可以存储在哈希表中的线性阵列!的

通过适当的调整,你可以再

适用于所有散元素在有限的空间,从而直接寻址的潜在选项 有一个反向查找在O(1)

在'老派'的工具来产生完美的哈希函数将的gperf(1)。维基百科列出了关于这个问题的更多的资源。

  

由于所有的辩论中的我跑了演示

     

下载NASDAQ股票代码并获得从该组100个随机样本,运用的gperf如下:

 的gperf -e'\ 015'-L C ++ -7 -C -E -k'*,1,$'-m 100选择> perfha​​sh.cpp
 

的哈希值MAX_HASH_VALUE结果157 和直接字符串查找尽可能多的项目表。下面是的只是的散列函数用于演示目的:

 在线unsigned int类型Perfect_Hash ::散(注册为const char *海峡,注册unsigned int类型的len){
  静态常量无符号字符asso_values​​ [] = {
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,64,40,1,62,1,
       41,18,47,0,1,11,10,57,21,7,
       14,13,24,3,33,89,11,0,19,5,
       12,0,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156,156,
      156,156,156,156,156,156,156,156,156
    };
  注册INT hval = LEN;

  开关(hval){
      默认:hval + = asso_values​​ [(无符号字符)海峡[4]; / * FALLTHROUGH * /
      案例4:hval + = asso_values​​ [(无符号字符)STR [3]; / * FALLTHROUGH * /
      案例3:hval + = asso_values​​ [(无符号字符)海峡[2] +1]; / * FALLTHROUGH * /
      案例2:hval + = asso_values​​ [(无符号字符)海峡[1]; / * FALLTHROUGH * /
      案例1:hval + = asso_values​​ [(无符号字符)海峡[0];打破;
  }
  返回hval;
}
 

     

这真的没有得到有效得多。请看看在 全在GitHub上来源:https://gist.github.com/sehe/5433535

     

你要知道,这是一个完美的哈希值,也因此会有没有冲突

  

问: [...] 这是obvioslyDELL。这样的查找必须比HashMap中查找显著快。的

答:如果您使用一个简单的的std ::地图的净效应是preFIX搜索(因为辞书字符串比较的快捷键第一个字符不匹配)。同样的事情会发生在一个排序容器二进制搜索。

[1] PS 。对于100个字符串,字符串的排序数组的std ::搜索的std :: LOWER_BOUND 将可能以最快的速度/更快由于参考 改进 地点。请咨询您的个人资料结果,看看这是否适用。

I understand that I should not optimize every single spot of my program so please consider this question to be "academic"

I have maximum 100 strings and integer number for each of them, something like that:

MSFT 1
DELL 2
HP   4
....
ABC  58

This set is preinitialized that means that once created it never changes. After set is initialized I use it pretty intensive so it nice to have fast lookup. Strings are pretty short, maximum 30 characters. Mapped int is also limited and between 1 and 100.

At least knowing that strings are preinitialized and never change it should be possible to "find" hash-function that results in "one basket-one item" mapping, but probably there are other hacks.

One optimization I can imagine - i can read first symbol only. For example if "DELL" is the only string starting with "D" and I have received something like "D***" than I do not need even to read the string! it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup". (well here I assumed that we receive only symbols that in hash, but it is not always the case)

Are there any ready to use or easy to implement solutions for my problem? i'm using c++ and boost.

upd I've check and found that for my exchange limit for ticker is 12 symbols, not 30 as mentioned above. However other exchanges may allow slighty longer symbols so it's interesting to have algorithm that will continue working on up to 20-charachters long tickers.

解决方案

A hashtable[1] is in principle the fastest way.

You could however compile a Perfect Hash Function given the fact that you know the full domain ahead of time.

With a perfect hash, there need not be a collision, so you can store the hash table in a linear array!

With proper tweaking you can then

fit all of the hash elements in a limited space, making direct addressing a potential option have a reverse lookup in O(1)

The 'old-school' tool for generating Perfect Hash functions would be gperf(1). The wikipedia lists more resources on the subject.

Because of all the debate I ran a demo:

Downloading NASDAQ ticker symbols and getting 100 random samples from that set, applying gperf as follows:

gperf -e ' \015' -L C++ -7 -C -E -k '*,1,$' -m 100 selection > perfhash.cpp

Results in a hash-value MAX_HASH_VALUE of 157 and a direct string lookup table of as many items. Here's just the hash function for demonstration purposes:

inline unsigned int Perfect_Hash::hash (register const char *str, register unsigned int len) {
  static const unsigned char asso_values[] = {
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156,  64,  40,   1,  62,   1,
       41,  18,  47,   0,   1,  11,  10,  57,  21,   7,
       14,  13,  24,   3,  33,  89,  11,   0,  19,   5,
       12,   0, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156
    };
  register int hval = len;

  switch (hval) {
      default: hval += asso_values[(unsigned char)str[4]];   /*FALLTHROUGH*/
      case 4:  hval += asso_values[(unsigned char)str[3]];   /*FALLTHROUGH*/
      case 3:  hval += asso_values[(unsigned char)str[2]+1]; /*FALLTHROUGH*/
      case 2:  hval += asso_values[(unsigned char)str[1]];   /*FALLTHROUGH*/
      case 1:  hval += asso_values[(unsigned char)str[0]];   break;
  }
  return hval;
}

It really doesn't get much more efficient. Do have a look at the full source at github: https://gist.github.com/sehe/5433535

Mind you, this is a perfect hash, too, so there will be no collisions

Q. [...] it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup".

A: If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.

[1] PS. For 100 strings, a sorted array of string with std::search or std::lower_bound would potentially be as fast/faster due to the improved Locality of Reference. Consult your profile results to see whether this applies.