对于一个正则表达式,如何确定配衬模式最长的字符串的长度?字符串、长度、最长、模式

2023-09-11 06:25:50 作者:别闹回家吃药

有一个正则表达式 regexPattern ,我怎么能确定的最长的字符串相匹配的 regexPattern 的长度。

Having a Regex pattern regexPattern, how can I determine the length of the longest string that matches the regexPattern.

INT LongestLength(字符串模式)应该是这样的:

Assert.Equals(LongestLength("[abc]"), 1);
Assert.Equals(LongestLength("(a|b)c?"), 2);

Assert.Equals(LongestLength("d*"), int.MaxValue); // Or throws NoLongestLengthException

虽然这个问题是在C#,C#和JavaScript的回答都不错。

Although the question is in C#, both C# and JavaScript answers are good.

推荐答案

这是pretty的直白了的正确的正则表达式只使用操作符 * + | ,加上括号字符类,当然还有普通字符。事实上,即使 \ 1 式的反向引用(这是不是一个正则表达式的形式化定义的一部分,并做复杂有关正则表达式的一些问题),可以毫无问题地处理。

It's pretty straightforward for a proper regex using just the operators ?, * and + and |, plus parentheses, character classes and of course ordinary characters. In fact even \1-style backreferences (which aren't part of the formal definition of a regex, and do complicate some questions about regexes) can be handled without problems.

一个正则表达式是一个树结构只是一种压缩编码(类似于如何数学公式​​是树结构描述的算术紧凑编码)。每对相邻字符之间有一个隐含的跟随对应于一个节点与2个孩子,一个是subregex只是其左侧,而另一个是正则表达式的整个其余操作者;通过分离subregexes序列| 字符对应一个ALT节点尽可能多的儿童有替代品(即不是 | 字符),而所有其他经营者有只是一个单一的儿童(即它运行在subregex),和每一个普通的字符或字符类没有孩子的。计算最大长度匹配的字符串,可以只执行此树结构的下至上遍历,在每个节点处贪婪地分配,将匹配节点最长的字符串的长度,最长弦的给定的知识,将符合其孩子。

A regex is just a compact encoding of a tree structure (similar to how mathematical formulas are compact encodings of tree structures describing arithmetic). Between every adjacent pair of characters there is an implicit "follows" operator that corresponds to a node with 2 children, one being the subregex just to its left, and the other being the entire rest of the regex; a sequence of subregexes separated by | characters corresponds to a single "alt" node with as many children as there are alternatives (i.e., one more than the number of | characters), while every other operator has just a single child (namely the subregex it operates on), and every ordinary character or character class has no children at all. To calculate the maximum-length matching string, you can just do a bottom-up traversal of this tree structure, at each node greedily assigning the length of the longest string that would match that node, given knowledge of the longest strings that would match its children.

有决定,在这种树匹配任何给定节点的最长字符串的长度的规则是:

The rules for deciding the length of the longest string that matches any given node in this tree are:

在如下(X,Y)( XY ):MAXLEN(X)+ MAXLEN(Y) ALT(A,B,...,Z)( A | B | ... | Z ):最大(MAXLEN(一),MAXLEN(B ),...,MAXLEN(Z)) 也许(X)( X ?):MAXLEN(X) 代表(X)( X * )或posrep(X)( X + ):无限 在其他任何一个字符或类( [...] ):1 \ 1 风格的反向引用:在MAXLEN为相应parenthesised EX pression follows(x, y) (xy): maxlen(x) + maxlen(y) alt(a, b, ..., z) (a|b|...|z): max(maxlen(a), maxlen(b), ..., maxlen(z)) maybe(x) (x?): maxlen(x) rep(x) (x*) or posrep(x) (x+): infinity Any other single character or character class ([...]): 1 \1-style backreferences: the maxlen for the corresponding parenthesised expression

一个后果是,在presence * + 的任何地方(除非逃脱或一部分字符类,很明显)会导致无穷传播了树,直到它击中的根源。

One consequence is that the presence of * or + anywhere (unless escaped or part of a character class, obviously) will cause infinity to propagate up the tree until it hits the root.

Regex: abcd
"Function call syntax": follows(a, follows(b, follows(c, d)))
As a tree:

follows
 /  \
a  follows
    /  \
   b  follows
       /  \
      c    d

第二个例子:

A second example:

Regex: (a|b|de)c?
"Function call" syntax: follows(alt(a, b, follows(d, e)), maybe(c))
As a tree:

     follows
     /     \
   alt     maybe
  / | \        \
 a  b follows   c
       /  \
      d    e

对于第二个正则表达式/树,自底向上穿越将为叶节点A,B,D,E和C指定的1 MAXLEN;则的maxlen用于底层如下()节点是1 + 1 = 2;则的maxlen为中高音()节点为max(1,1,2)= 2;该MAXLEN的也许节点是1;所述的maxlen为最上面的如下()节点,因此对整个正则表达式,为2 + 1 = 3。

For this second regex/tree, a bottom-up traversal will assign a maxlen of 1 for the leaf nodes a, b, d, e and c; then the maxlen for the bottom follows() node is 1 + 1 = 2; then the maxlen for the alt() node is max(1, 1, 2) = 2; the maxlen for the maybe node is 1; the maxlen for the topmost follows() node, and thus for the entire regex, is 2 + 1 = 3.

如果你的意思是正则表达式,让其他的Perl风格的增强功能,它可能会变得更加复杂,因为长度的局部最优选择,可能会导致全球次优的。 (我原以为它甚至有可能是Perl风格的扩展让正则表达式图灵完备的,这意味着它会在一般无法确定是否存在的任意的匹配字符串 - 但的

 
精彩推荐
图片推荐