有没有一种简单的方法来块一个文本文件导入到大括号平衡的部分?括号、方法来、文本文件、导入到

2023-09-11 02:41:09 作者:只有我的吻才能让你安静

我试图解析一些数据出来使用Perl和放大器的文件;解析:: RecDescent而言。在perl脚本我不能把整个数据文件,因为RecDescent而言将需要数天时间钻研它。于是我分手了庞大的数据文件到RD-大小的块,以减少运行时间。

不过,我需要平衡括号中提取部分和程序我现在是不是稳健的(这取决于太多在最后关闭支架从一个新行的位置)。例如:

 电池(标识符){
  关键字2 {};
  ...
  keyword3 {keyword4 {}};
}

...更多部分...
 

我要抓住一切从单元... {到对应的} 它可以有不同数量的的间距和子区

必须有一些Linux命令行的事很容易地做到这一点?任何想法?

编辑:输入文件是800万左右,语法〜60规则

解决方案

为什么RecDescent的时间这么长?是不是因为你的语法很复杂?如果是这样的话,你可以在两个双电平通用解析:: RecDescent的。我们的想法是,你会定义一个简单的语法解析细胞... {...},然后通过分析从第一个分析器输出导入到调用解析:: RecDescent的你更复杂的语法。这是猜测有关RecDescent的是对数据速度慢的原因。

另一种方法是写自己的简单解析器的单元格条目匹配,计算它是迄今看到括号的数量,然后找到匹配的括号时结束花数等于左括号计数。这应该是快,但上述建议可能会更快实施,更易于维护。

编辑:你一定要尝试解析:: RecDescent的一个简化的语法。递归下降语法分析的算法的复杂性成比例,可以分析树的数目,这应该是这样的为B ^ N,其中B是在你的语法的分支点的数量,而N是节点的数量。

如果你想尝试滚动自己的简单解析器第一遍在你的输入,下面的code可以让你开始。

 #!的/ usr / bin中/ perl的-w

使用严格的;

我的$ INPUT_FILE =输入;
打开文件,< $ INPUT_FILE或死亡$!;

我的$ in_block = 0;
我的$ current_block ='';
我的$ open_bracket_count = 0;
而(我的$行=<文件>){
    如果($行=〜/电池/){
    $ in_block = 1;
    }

    如果($ in_block){
    而($线=〜/([\ {\}] {1})/克){
    我的$令牌= $ 1;
    如果($令牌EQ'{'){
    $ open_bracket_count ++;
    } ELSIF($令牌EQ}){
    $ open_bracket_count--;
    }
    }

    $ current_block = $行。
    }

    如果($ open_bracket_count == 0安培;&安培; $ current_block NE''){
    打印 - ×80,\ N的;
    打印$ current_block,\ N的;
    $ in_block = 0;
    $ current_block ='';
    }
}
关闭文件或死亡$!;
 
两个文件解压之后真的没有导入工具啊

编辑:改变code,以避免啜整个文件到内存中。虽然这是微不足道的一个8MB的文件,它的清洁剂,只是读取行由行的文件。

I'm trying to parse some data out of a file using Perl & Parse::RecDescent. I can't throw the full data file at the perl script because RecDescent will take days poring over it. So I split up the huge datafile into RD-sized chunks to reduce the runtime.

However, I need to extract sections within balanced brackets and the routine I have now is not robust (it depends too much on the position of the final close-bracket from a newline). Example:

cell ( identifier ) {
  keyword2 { };
  ...
  keyword3 { keyword4 {  } };
}

...more sections...

I need to grab everything from cell ... { to the matching closing } which can have various amounts of spacing and sub-sections.

There must be some linux command line thing to do this easily? Any ideas?

Edit: Input files are around 8M, grammar ~60 rules.

解决方案

Why does RecDescent take so long? Is it because your grammar is complex? If that's the case, you could two a bi-level pass using Parse::RecDescent. The idea is that you would define a simple grammar that parses cell ... { ... } and then passes parsed output from the first parser into a call to Parse::RecDescent with your more complex grammar. This is guessing about the reason for RecDescent being slow on your data.

Another option is to write your own simple parser that matches on the cell entries, counts the number of braces it's seen so far, and then finds the matching brace when the closing brace count is equal to the opening brace count. That should be fast, but the suggestion above might be faster to implement and easier to maintain.

Edit: You should definitely try Parse::RecDescent with a simplified grammar. The algorithmic complexity of recursive descent parsing is proportional to the number of possible parse trees, which should be something like is B ^ N, where B is the number of branching points in your grammar, and N is the number of nodes.

If you'd like to try rolling your own simple parser for a first pass over your input, the following code can get you started.

#!/usr/bin/perl -w

use strict;

my $input_file = "input";
open FILE, "<$input_file" or die $!;

my $in_block = 0;
my $current_block = '';
my $open_bracket_count = 0;
while( my $line = <FILE> ) {
    if ( $line =~ /cell/ ) {
    	$in_block = 1;
    }

    if ( $in_block ) {
    	while ( $line =~ /([\{\}]{1})/g ) {
    		my $token = $1;
    		if ( $token eq '{' ) {
    			$open_bracket_count++;
    		} elsif ( $token eq '}' ) {
    			$open_bracket_count--;
    		}
    	}

    	$current_block .= $line;
    }

    if ( $open_bracket_count == 0 && $current_block ne '' ) {
    	print '-' x 80, "\n";
    	print $current_block, "\n";
    	$in_block = 0;
    	$current_block = '';
    }
}
close FILE or die $!;

Edit: changed code to avoid slurping the entire file into memory. While this is trivial for an 8MB file, it's cleaner to just read the file in line-by-line.