最快捷的方式找到前k个频繁词在一个大的字序列序列、频繁、快捷、方式

2023-09-10 22:36:55 作者:养猪仙人

输入:一个正整数K和一个大的文本。该文本实际上可以看作是单词序列。所以我们不必担心如何分解成单词序列。 输出:在文本中最常见的K字

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence. Output: The most frequent K words in the text.

我的想法是这样的。

使用哈希表来记录所有的话'的频率,而遍历整个单词序列。在这个阶段中,关键是字,该值为字频。这需要O(n)的时间。

use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

排序(字,字频)对;关键是字频。这需要为O(n * LG(N))与正常排序算法的时间。

sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

整理后,我们只取前K字。这需要O(K)的时间。

After sorting, we just take the first K words. This takes O(K) time.

总之,总时间为O(N + N的 LG(N)+ K),由于K是肯定小于N,所以它实际上是为O(n 的LG(N))

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

我们可以改善这一点。其实,我们只是想顶K字。换句话说'频率不关心我们。因此,我们可以用局部堆排序。对于步骤2),3),我们不只是做排序。相反,我们改变它是

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2)建立的(字,字频)对堆用字频率为重点。它需要O(n)的时间来建立一个堆;

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3')中提取从堆顶部K个单词。每个提取是O(LG(N))。因此,总的时间是O(K * LG(N))。

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

总之,这种解决方案成本的时间为O(n + K * LG(N))。

To summarize, this solution cost time O(n+k*lg(n)).

这就是我的想法。我还没有找到方法来改善步骤1)。 我希望有信息检索专家可以阐明这个问题更多的光。

This is just my thought. I haven't find out way to improve step 1). I Hope some Information Retrieval experts can shed more light on this question.

推荐答案

这可以为O完成(n)的时间

This can be done in O(n) time

解决方案1:

步骤:

计数的单词和散列它,这最终会在这样的结构 Excel序列是灰色的,每次下拉都是相同的数字,比如,输入1往下拉就是全部是1,求解怎么变为123

Count words and hash it, which will end up in the structure like this

var hash = {
  "I" : 13,
  "like" : 3,
  "meow" : 3,
  "geek" : 3,
  "burger" : 2,
  "cat" : 1,
  "foo" : 100,
  ...
  ...

通过散列和导线找到(在这种情况下,富100),然后创建该尺寸的阵列中最常用的词

Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size

那么我们可以再次遍历哈希和使用出现的话作为数组索引的数量,如果没有索引,创建数组否则将其追加到数组中。然后,我们结束了一个数组,如:

Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:

  0   1      2            3                100
[[ ],[ ],[burger],[like, meow, geek],[]...[foo]]

然后,只需遍历从最终的数组,并收集了K字。

Then just traverse the array from the end, and collect the k words.

解决方案2:

步骤:

:同上 使用最小堆,并保持最小堆的尺寸为k,并在散列每个单词我们比较词的出现次数与分钟,1)如果它是大于最小值,取出分钟(如果尺寸最小堆等于k),并插入在最小堆的数量。 2)休息条件简单。 通过遍历数组后,我们就在最小堆转换为数组,并返回数组。
 
精彩推荐
图片推荐