查找大文件重复的字符串字符串、大文件

2023-09-11 23:05:56 作者:你可爱的爹

一个文件包含字符串的大量(eg.10十亿),你需要找到重复的字符串。你有N多的可用系统。你将如何找到重复

A file contains a large number (eg.10 billion) of strings and you need to find duplicate Strings. You have N number of systems available. How will you find duplicates

推荐答案

将文件分割成N个。在每一台机器,装载尽可能多的一块到内存中,你可以和字符串进行排序。写这些大块的海量存储在该计算机上。每个机,合并块成单个流,然后合并来自每个机的流分成包含所有按排序顺序串的流。比较每个字符串以previous。如果它们是相同的,它是一种重复的

Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.