如何在HyperLogLog算法的工作吗?算法、如何在、工作、HyperLogLog

2023-09-10 23:23:14 作者:踏雪无痕

我一直在学习在我的业余时间不同的算法近日,一个我碰到这似乎是很有趣的被称为HyperLogLog算法 - 这估计独特的项目有多少是在一个列表

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.

这是特别有趣的我,因为它把我带回我的MySQL天,当我看到基数值(我一直认为直到最近,它的计算未估计)。

This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).

所以我知道如何编写算法为O(n),将计算出独特的项目有多少是在数组中。我在Javascript写这个

So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in Javascript

function countUniqueAlgo1(arr) {
    var Table = {};
    var numUnique = 0;
    var numDataPoints = arr.length;
    for (var j = 0; j < numDataPoints; j++) {
        var val = arr[j];
        if (Table[val] != null) {
            continue;
        }
        Table[val] = 1;
        numUnique++;
    }
    return numUnique;
}

但问题是,我的算法,而O(N),使用了大量的内存(存储值)。

我一直在阅读本文对如何计算重复在O(n)的时间,使用最少的内存列表。 http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory. http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

据介绍,通过散列和计数位什么的人能在一定范围内的概率估算(假设列表中均匀分布)的唯一项目的数量在列表中。

It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.

我读过的文章,但我似乎无法理解。有人可以给出一个更外行的解释?我知道什么是哈希值,但我不明白它们是如何在这个HyperLogLog算法中使用。

I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.

推荐答案

我们用 HyperLogLog 算法相当广泛在我们的基础设施。因此,我们最终要解释给每个人,从高管到开发商。为了让我们的生活变得更轻松,我们已经提出了一个blog帖子它,已包括写在 D3 来帮助说明它模拟。我希望它能帮助!

We use the HyperLogLog algorithm quite extensively in our infrastructure. As a result we end up trying to explain it to everyone from executives to developers. To make our lives a little easier, we have put up a blog post on it and have included a simulation written in D3 to help illustrate it. I hope it helps!