有没有在拉宾,卡普字符串搜索算法中使用的滚动散列函数的任何工作的实施?字符串、算法、函数、工作

2023-09-11 03:29:51 作者:凉城以北深海未眠

我希望利用滚动哈希函数这样我就可以采取正克一个非常大的字符串的哈希值。

I'm looking to use a rolling hash function so I can take hashes of n-grams of a very large string.

例如:

计算器,分成5克是:

栈,tacko,ackov,ckove   kover,overf,verfl,erflo,rflow

"stack", "tacko", "ackov", "ckove", "kover", "overf", "verfl", "erflo", "rflow"

这是理想的滚动哈希函数,因为之后我计算第一正克散,下面的是相对便宜的计算,因为我只是有下降的第一个散列的第一个字母,并添加新的最后一个字母第二哈希值。

This is ideal for a rolling hash function because after I calculate the first n-gram hash, the following ones are relatively cheap to calculate because I simply have to drop the first letter of the first hash and add the new last letter of the second hash.

我知道一般这个哈希函数生成的:

I know that in general this hash function is generated as:

H =ç 1 在 K - 1 + C 2 在 K - 2 + C 3 在 K - 3 + ... + C K 在 0 其中a是一个常数,C1,.. ,CK是输入的字符。

H = c1ak − 1 + c2ak − 2 + c3ak − 3 + ... + cka0 where a is a constant and c1,...,ck are the input characters.

如果您按照上拉宾,卡普字符串搜索算法这个链接,它规定,一个通常是一些大的素数。

If you follow this link on the Rabin-Karp string search algorithm , it states that "a" is usually some large prime.

我希望我的哈希存储在32位整数,所以如何一个主要的大要一是,这样我就不会溢出我的整数?

I want my hashes to be stored in 32 bit integers, so how large of a prime should "a" be, such that I don't overflow my integer?

是否存在这个哈希函数的现有实现的地方,我可能已经使用?

Does there exist an existing implementation of this hash function somewhere that I could already use?

下面是我创建了一个实现:

Here is an implementation I created:

public class hash2
{

    public int prime = 101;

    public int hash(String text)
    {
        int hash = 0;

        for(int i = 0; i < text.length(); i++)
        {
            char c = text.charAt(i);
            hash += c * (int) (Math.pow(prime, text.length() - 1 - i));
        }

        return hash;
    }

    public int rollHash(int previousHash, String previousText, String currentText)
    {

        char firstChar = previousText.charAt(0);
        char lastChar = currentText.charAt(currentText.length() - 1);

        int firstCharHash = firstChar * (int) (Math.pow(prime, previousText.length() - 1));
        int hash = (previousHash - firstCharHash) * prime + lastChar;

        return hash;
    }

    public static void main(String[] args)
    {
        hash2 hashify = new hash2();

        int firstHash = hashify.hash("mydog");
        System.out.println(firstHash);
        System.out.println(hashify.hash("ydogr"));
        System.out.println(hashify.rollHash(firstHash, "mydog", "ydogr"));
    }

}

我用101作为我的素数。它的问题,如果我的哈希值会溢出?我认为这是可取的,但我不知道。

I'm using 101 as my prime. Does it matter if my hashes will overflow? I think this is desirable but I'm not sure.

这看起来像正确的方式去了解呢?

Does this seem like the right way to go about this?

推荐答案

我记得有一个稍微不同的实现,它似乎是从塞奇威克的算法书籍之一(还包含例如code - 尝试一下吧)。这里的调整,以32位整数摘要:

i remember a slightly different implementation which seems to be from one of sedgewick's algorithms books (it also contains example code - try to look it up). here's a summary adjusted to 32 bit integers:

您使用模运算来prevent从每次操作后溢出的整数。

you use modulo arithmetic to prevent your integer from overflowing after each operation.

初​​始设置:

C =文本(计算器) M =长度的正克 D =您的字母大小(256) Q =一个​​大素数,使(D + 1)* Q不会溢出(8355967可能是一个不错的选择) DM = D M-1 模q c = text ("stackoverflow") M = length of the "n-grams" d = size of your alphabet (256) q = a large prime so that (d+1)*q doesn't overflow (8355967 might be a good choice) dM = dM-1 mod q

首先计算第一n-gram中的散列值:

first calculate the hash value of the first n-gram:

h = 0
for i from 1 to M:
  h = (h*d + c[i]) mod q

和每一个下面的N-克:

and for every following n-gram:

for i from 1 to lenght(c)-M:
  // first subtract the oldest character
  h = (h + d*q - c[i]*dM) mod q

  // then add the next character
  h = (h*d + c[i+M]) mod q

你为什么要减去最早的字符之前将d * Q的原因是因为你可能会遇到因造成previous模操作小值负值。

the reason why you have to add d*q before subtracting the oldest character is because you might run into negative values due to small values caused by the previous modulo operation.

错误,但我想你应该明白了吧。试图找到一个塞奇威克的算法书籍的详细信息,更少的错误和更好的描述。 :)

errors included but i think you should get the idea. try to find one of sedgewick's algorithms books for details, less errors and a better description. :)