算法用于检测数据集,则过大的重复被完全加载到存储器存储器、过大、算法、加载

2023-09-11 04:40:04 作者:黎明独徘徊

有这个问题的最佳解决方案?

描述的算法查找重复的一百万个电话号码的文件。该算法,运行的时候,就只有两兆字节的内存提供给它,这意味着你不能在一次加载的所有电话号码到内存中。

我的天真的解决方案将是一个为O​​(n ^ 2)解决方案,它遍历值,只是加载的数据块,而不是所有的文件一次。

  

对于i = 0至999,999

 字符串currentVal =获得索引的项目我

对于j = I + 1至999,999
  如果(的J  - 我modfileChunkSize == 0)
    加载文件数据块到数组
  如果数据[J] == currentVal
    添加currentVal到duplicateList和退出的
 

必须有另一种情况是你可以加载整个数据集在一个非常独特的方式和验证一些是重复的。任何人有一个?

解决方案

除以文件划分为M块,其中每一个是足够大的存储器进行排序。他们排序在内存中。

有关各组两个组块,这样我们可以再进行归并的两个组块的最后一步使一个较大块(C_1 + c_2的)(+ C_3 C_4)..(c_m-1 + c_m)

点上C_1和C_2磁盘上的第一个元素,并作出新的文件(我们称之为C_1 + 2)。

7 papers 周志华深度森林新论文 谷歌目标检测新SOTA

如果C_1的指向的元件是一个数字比C_2的指向的元件变小,将其复制到C_1 + 2和指向C_1的下一个元素。 否则,复制C_2的尖锐的元素融入并指向C_2的下一个元素。

重复previous步骤,直到两个数组是空的。你只需要使用需要将两个指向的号码在存储器的空间。在此过程中,如果遇到C_1和C_2的指向的元素都是平等的,你找到了重复的 - 你可以复制它两次,并增加两个指针

将所得的米/ 2阵列所用的相同方式之递归合并这将需要记录(米)的这些合并步骤,以产生正确的阵列。每个数字将被相比较的方式,将会找到重复彼此号

另外,由@Evgeny Kluev提到一个快速和肮脏的解决方案是让布隆过滤器是一样大,你可以合理地适合在内存中。然后,可以使穿过文件的第二时间失败bloom滤波器和循环以测​​试这些部件的重复的每一个元素的索引的列表。

Is there an optimal solution to this problem?

Describe an algorithm for finding duplicates in a file of one million phone numbers. The algorithm, when running, would only have two megabytes of memory available to it, which means you cannot load all the phone numbers into memory at once.

My 'naive' solution would be an O(n^2) solution which iterates over the values and just loads the file in chunks instead of all at once.

For i = 0 to 999,999

string currentVal = get the item at index i

for j = i+1 to 999,999
  if (j - i mod fileChunkSize == 0)
    load file chunk into array
  if data[j] == currentVal
    add currentVal to duplicateList and exit for

There must be another scenario were you can load the whole dataset in a really unique way and verify if a number is duplicated. Anyone have one?

解决方案

Divide the file into M chunks, each of which is large enough to be sorted in memory. Sort them in memory.

For each set of two chunks, we will then carry out the last step of mergesort on two chunks to make one larger chunk (c_1 + c_2) (c_3 + c_4) .. (c_m-1 + c_m)

Point at the first element on c_1 and c_2 on disk, and make a new file (we'll call it c_1+2).

if c_1's pointed-to element is a smaller number than c_2's pointed-to element, copy it into c_1+2 and point to the next element of c_1. Otherwise, copy c_2's pointed element into and point to the next element of c_2.

Repeat the previous step until both arrays are empty. You only need to use the space in memory needed to hold the two pointed-to numbers. During this process, if you encounter c_1 and c_2's pointed-to elements being equal, you have found a duplicate - you can copy it in twice and increment both pointers.

The resulting m/2 arrays can be recursively merged in the same manner- it will take log(m) of these merge steps to generate the correct array. Each number will be compared against each other number in a way that will find the duplicates.

Alternately, a quick and dirty solution as alluded to by @Evgeny Kluev is to make a bloom filter which is as large as you can reasonably fit in memory. You can then make a list of the index of each element which fails the bloom filter and loop through the file a second time in order to test these members for duplication.

 
精彩推荐
图片推荐