全自动正则表达式生成器生成器、全自动、正则表达式

2023-09-11 00:05:32 作者:菰獨者の_smile.

我有N个字符串。 此外,有K定期EX pressions,不知道的我。每个字符串或者匹配正EX pressions之一,或者是垃圾。总共有在集合L垃圾字符串。两个K,L是未知的。

我想演绎出常规的前pressions。显然,这个问题已解的无限数目。我需要找到一个相当不错的解决方案,其中

1)减少氏/ P>

2)减少→

3)最大限度地提高了常规EX pressions的细节。我不知道what't正确的术语,这种品质做。例如,字符串AB123可谓/ AB \ D + /或/\w+.+/,但第一正则表达式是更具体

所有3要求需要被视为一种化合物的标准,与某些合理的权重

针对一个特定的情况下,一个解决方案:如果L = 0和K = 1(只有一个正则表达式,并没有垃圾),那么我们就可以随便找LCS(最长公共子)的字符串,并拿出从相应的正则表达式那里。然而,当我们有噪音(L> 0),这种方法是行不通的。

任何想法(或指针,以现有的工作)都大大AP preciated。

解决方案

你所要做的是语言学习或语言推理的一拧:而不是< STRONG>要概括通过一组给定的实例(也可能是反例),要推断用小语言又具体的的语法。

我不知道有多少研究正在上完成的。不过,如果你也有兴趣在寻找最小(=一般)定期EX pression是接受所有的 N 的字符串,搜索上的 MDL (最小描述长度)和的 有限状态机(有限状态机)。

在谷歌学术搜索两个有趣的疑问:

最小描述长度自动机 语言推论自动

I have N strings. Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.

I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which

正则表达式自动生成器标准版 v2.0 官方版

1) minimizes K

2) minimizes L

3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".

All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.

A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.

Any ideas (or pointers to existing work) are greatly appreciated.

解决方案

What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.

I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).

Two interesting queries at Google Scholar:

"minimum description length" automata "language inference" automata