字符串的超大亚群的比较字符串、大亚

2023-09-11 23:08:16 作者:游辰

大家好:)我真的很困惑的一个任务: - /

Hi guys :) I really confused of one task :-/

有从200万为4000000的字符串,其中包含了独特的15-文号线通过这样的行一个每一天的文件:

There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this:

850025000010145
401115000010152
400025000010166
770025555010152
512498004158752

从当年开始你有相应的此类文件的部分金额。所以我今天的文件的每一行与所有previous文件,从今年年初比较,并只返回数字,从来没有达到之前,在所有选中的文件。

From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked files.

我应该使用哪种语言,算法?如何实现它?

Which language and algorithm should I use? How to implement it?

推荐答案

您应该能够做到这一点,而无需编写任何code超越一个简单的脚本(如bash中,Windows批处理,PowerShell的,等等)。有标准的工具,使这种类型的东西快的工作。

You should be able to do this without having to write any code beyond a simple script (i.e. bash, Windows batch, Powershell, etc.). There are standard tools that make quick work of this type of thing.

首先,你必须包含从200万到400万个号码的文件一定数目。这是很难与所有这些文件的工作,所以你要做的第一件事就是创建一个真实排序组合的文件。笨笨的方式来做到这一点是所有的文件拼接成一个文件,排序,并删除重复。例如,使用GNU / Linux的排序命令:

First, you have some number of files that contain from 2 million to 4 million numbers. It's difficult to work with all those files, so the first thing you want to do is create a combined file that's sorted. The simple-minded way to do that is to concatenate all the files into a single file, sort it, and remove duplicates. For example, using the GNU/Linux cat and sort commands:

cat file1 file2 file3 file4 > combined
sort -u combined > combined_sort

(将 -u 删除重复)

与该方法的问题是,你最终的排序非常大的文件。图4万余株,在15个字符,再加上换行,每行和文件近100天,你正在使用7千兆字节。整整一年的价值的数据是25千兆字节。这需要很长的时间。

The problem with that approach is that you end up sorting a very large file. Figure 4 million lines at 15 characters, plus newlines, on each line, and almost 100 days of files, and you're working with 7 gigabytes. A whole year's worth of data would be 25 gigabytes. That takes a long time.

所以取而代之,排序每个单独的文件,然后将它们合并:

So instead, sort each individual file, then merge them:

sort -u file1 >file1_sort
sort -u file2 >file2_sort
...
sort -m -u file1 file2 file3 > combined_sorted

-m 开关合并已排序的文件。

The -m switch merges the already-sorted files.

现在你有什么是你到目前为止看到的标识符的排序列表。你想今天的文件与之媲美。首先,排序今天的文件:

Now what you have is a sorted list of all the identifiers you've seen so far. You want to compare today's file with that. First, sort today's file:

sort -u today >today_sort

现在,您可以将文件,只输出文件独特的比较今天的文件:

Now, you can compare the files and output only the files unique to today's file:

comm -2 -3 today_sort combined_sort

-2 表示,只出现在第二个文件和 -3 表示燮preSS线到晚饭preSS线共有的两个文件。因此,所有你得到的是在行today_sort 指不以 combined_sort 存在。

-2 says suppress lines that occur only in the second file, and -3 says to suppress lines that are common to both files. So all you'll get is the lines in today_sort that don't exist in combined_sort.

现在,如果你打算每天都这样做,那么你需要采取从通讯命令的输出,并与进行合并combined_sort ,以便您可以使用该合并文件的明天。这prevents您不必重建 combined_sort 文件每一天。所以:

Now, if you're going to do this every day, then you need to take the output from the comm command and merge it with combined_sort so that you can use that combined file tomorrow. That prevents you from having to rebuild the combined_sort file every day. So:

comm -2 -3 today_sort combined_sort > new_values

然后:

sort -m combined_sort new_values > combined_sort_new

您很可能希望与命名日的文件,所以你必须 combined_sort_20140401 combined_sort_20140402 等等。

You'd probably want to name the file with the date, so you'd have combined_sort_20140401 and combined_sort_20140402, etc.

所以,如果你开始在今年年初,想要每天都这样,你的脚本看起来是这样的:

So if you started at the beginning of the year and wanted to do this every day, your script would look something like:

sort -u $todays_file > todays_sorted_file
comm -2 -3 todays_sorted_file $old_combined_sort > todays_uniques
sort -m $old_combined_sort todays_sorted_file > $new_combined_sort

$ todays_file $ old_combined_sort $ new_combined_sort 是你通过命令行参数。因此,如果脚本是人民日报呼吁:

$todays_file, $old_combined_sort, and $new_combined_sort are parameters that you pass on the command line. So, if the script was called "daily":

daily todays_file.txt all_values_20140101 all_values_20140102
 
精彩推荐
图片推荐