匹配相同字符的确切的N次重复确切、字符

2023-09-03 03:25:51 作者:尸体派对

我怎样写一个前pression匹配相同的字符(或理想,同组)的确切的N次重复?基本上, \ 1 {N-1} 没有,但有一个重要的限制():前pression应该失败,如果这个问题被重复的更多的比N倍。例如,给定 N = 4 和字符串 xxaaaayyybbbbbzzccccxx ,前pressions应该与 AAAA 中交,而不是 BBBB

How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.

我不专注于任何特定的方言,随意使用任何语言。请不要张贴code,它适用于只,我正在寻找一个通用的解决方案这一具体的例子。

I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.

推荐答案

使用负向前查找的和的负回顾后。

Use negative lookahead and negative lookbehind.

这将是正则表达式:();(?!\ 1)(?!&LT \ 1)\ 1 {N-1} 除了Python的re模块坏了(见此链接)。

This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).

英语翻译:匹配任意字符,请确保在您匹配的字符,面前的性格是不是也该字符的字符匹配的N-1以上的重复确保这些重复后面的字符。不也是这个角色。

English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."

不幸的是,重模块(也是最正规的前pression引擎)被打破,因为你不能在向后断言使用后向引用。向后断言都要求是恒定的长度,并在编译器没有足够的智慧推断它是当一个反向引用被使用(即使,如在这种情况下,backref是恒定长度的)。我们必须手握正则表达式编译通过这一点,像这样:

Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:

在实际的答案将不得不梅西耶:(。) R(?≤(= \ 1)..!?)\ 1 {N-1 }(?!\ 1)

The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"

本使用作品围绕re模块中的错误(?= \ 1).. 而不是 \ 1。(这些都是等效的大部分时间。)这允许正则表达式引擎知道向后断言完全相同的宽度,所以它在PCRE和重新等

This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.

当然,现实世界的解决方案是像 [x.group()对于x的re.finditer(R(。)\ 1 *,xxaaaayyybbbbbzzccccxx)如果len( x.group())== 4]

Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]