您所在的位置：首页 > 最新热文 > 计算机探索

全自动正则表达式生成器生成器、全自动、正则表达式

2023-09-11 00:05:32 作者：菰獨者の_smile.

我有N个字符串。此外，有K定期EX pressions，不知道的我。每个字符串或者匹配正EX pressions之一，或者是垃圾。总共有在集合L垃圾字符串。两个K，L是未知的。

我想演绎出常规的前pressions。显然，这个问题已解的无限数目。我需要找到一个相当不错的解决方案，其中

1）减少氏/ P>

2）减少→

3）最大限度地提高了常规EX pressions的细节。我不知道what't正确的术语，这种品质做。例如，字符串AB123可谓/ AB \ D + /或/\w+.+/，但第一正则表达式是更具体

所有3要求需要被视为一种化合物的标准，与某些合理的权重

针对一个特定的情况下，一个解决方案：如果L = 0和K = 1（只有一个正则表达式，并没有垃圾），那么我们就可以随便找LCS（最长公共子）的字符串，并拿出从相应的正则表达式那里。然而，当我们有噪音（L> 0），这种方法是行不通的。

任何想法（或指针，以现有的工作）都大大AP preciated。

解决方案

你所要做的是语言学习或语言推理的一拧：而不是< STRONG>要概括通过一组给定的实例（也可能是反例），要推断用小语言又具体的的语法。

我不知道有多少研究正在上完成的。不过，如果你也有兴趣在寻找最小（=一般）定期EX pression是接受所有的 N 的字符串，搜索上的 MDL （最小描述长度）和的有限状态机（有限状态机）。

在谷歌学术搜索两个有趣的疑问：

最小描述长度自动机语言推论自动

I have N strings. Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.

I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which

1) minimizes K

2) minimizes L

3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".

All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.

A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.

Any ideas (or pointers to existing work) are greatly appreciated.

解决方案

What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.

I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).

Two interesting queries at Google Scholar:

"minimum description length" automata "language inference" automata

上一篇：什么样的算法提供了最好的最坏情况下的性能如何？最好的、算法、最坏、情况下

下一篇：如何部分比较两个图两个、部分

相关推荐