我一直在学习在我的业余时间不同的算法近日,一个我碰到这似乎是很有趣的被称为HyperLogLog算法 - 这估计独特的项目有多少是在一个列表
I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
这是特别有趣的我,因为它把我带回我的MySQL天,当我看到基数值(我一直认为直到最近,它的计算未估计)。
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
所以我知道如何编写算法为O(n),将计算出独特的项目有多少是在数组中。我在Javascript写这个
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in Javascript
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
但问题是,我的算法,而O(N),使用了大量的内存(存储值表
)。
我一直在阅读本文对如何计算重复在O(n)的时间,使用最少的内存列表。 http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory. http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
据介绍,通过散列和计数位什么的人能在一定范围内的概率估算(假设列表中均匀分布)的唯一项目的数量在列表中。
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
我读过的文章,但我似乎无法理解。有人可以给出一个更外行的解释?我知道什么是哈希值,但我不明白它们是如何在这个HyperLogLog算法中使用。
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
我们用 HyperLogLog
算法相当广泛在我们的基础设施。因此,我们最终要解释给每个人,从高管到开发商。为了让我们的生活变得更轻松,我们已经提出了一个blog帖子它,已包括写在 D3
来帮助说明它模拟。我希望它能帮助!
We use the HyperLogLog
algorithm quite extensively in our infrastructure. As a result we end up trying to explain it to everyone from executives to developers. To make our lives a little easier, we have put up a blog post on it and have included a simulation written in D3
to help illustrate it. I hope it helps!