扩展定期EX pression语法说'不包含文本XYZ“语法、不包含、文本、EX

2023-09-03 16:03:01 作者:时光他是个庸医

我有一个应用程序,用户可以在多个地方指定经常EX pressions。在运行应用程序来检查文本(例如URL和HTML)相匹配的正则表达式,这些被使用。通常情况下,用户希望能够说的其中的文本相匹配ABC和不匹配XYZ 的。为了让他们更容易做到这一点,我想了一个办法,说在我的应用程序扩展定期EX pression语法的,并且不包含模式的 。在一个很好的方式有什么建议这样做吗?

I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?

我的应用程序是用C#.NET 3.5。

My app is written in C# .NET 3.5.

目前我在考虑使用¬性格:任何事情¬字符之前是一个正常的普通恩pression,在¬字符后任何事情都是有规律EX pression不能在文本匹配测试

Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.

所以,我可能会使用一些正则表达式是这样的(人为)例如:

So I might use some regexes like this (contrived) example:

on (this|that|these) day(s)?¬(every|all) day(s) ?

其例如将匹配在这一天的人说... 的',但不匹配的在这一天,每一天,都会有...

在我的code,用于处理正则表达式,我会简单地分割出正则表达式的两部分,并分别对其进行处理,例如:

In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:

    public bool IsMatchExtended(string textToTest, string extendedRegex)
    {
        int notPosition = extendedRegex.IndexOf('¬');

        // Just a normal regex:
        if (notPosition==-1)
            return Regex.IsMatch(textToTest, extendedRegex);

        // Use a positive (normal) regex and a negative one
        string positiveRegex = extendedRegex.Substring(0, notPosition);
        string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);

        return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
    }

这是一个更好的办法任何建议,以实现这种扩展?我需要稍微聪明折中一下字符串的字符¬允许它进行转义,所以不会只用上述简单的子串()分裂。还有什么考虑?

Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?

在写这个问题我也碰到this回答这表明使用这样的事情:

In writing this question I also came across this answer which suggests using something like this:

^(?=(?:(?!negative pattern).)*$).*?positive pattern

所以,我可以只建议人们使用,而不是我的原计划的模式一样,当他们想不匹配特定文本。

So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.

那会做我的原计划的相同呢?我认为这是相当昂贵的方式​​做到这一点效果进行明智的,而且因为我有时大解析HTML文档,这可能是一个问题,而我想我原来的计划将是更好的性能。有什么想法?(除了明显的:'都试一下,并测量它们')

Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?

也许相关的性能:有时会出现几个'字'或更复杂的正则表达式不能在文本,如(每|全)在我上面的例子,但有一些更多的变化

Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.

我知道我原来的做法似乎不可思议,如为什么不只是有两个正则表达式!?但是,在我的特殊应用程序管理员提供常规的前pressions,这将是相当困难的,给他们提供两个常规EX pressions无处不在,他们目前可以提供一种能力。在这种情况下,更容易有不语法 - 只相信我在这一点上

I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.

我有一个应用程序,可以让管理员在不同的配置点确定正规EX pressions。定期EX pressions只是用来检查文本或网址匹配某种模式;替换不作出与不使用捕捉组。然而,他们往往想指定一种模式,上​​面写着其中ABC是不是在文本。这是非常难以做到定期EX pressions不匹配,所以通常的办法就是有两个常规EX pressions:一个指定必须匹配,一个模式指定必须不匹配模式。如果第一匹配,第二个是不那么文本确实匹配。在我的应用程序这将是一个大量的工作,加入到有第二次定期EX pression在每个位置上的用户可以提供1个现在的能力,所以我想延长定期EX pression语法了一个办法,说并不含 模式的。

I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain pattern'.

推荐答案

您并不需要引入一个新的符号。目前市场上已经是你所需要的大多数正则表达式引擎的支持。这是学习它,运用它只是一个问题。

You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.

您有关于性能方面的问题,但你有没有测试过?你有没有衡量和表现出的性能问题?这可能会就好了。

You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.

正则表达式的作品很多很多人,很多很多不同的方案。它可能是适合你的要求了。

Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.

另外,就上找到其他太问题的复杂的regex,可以简化。有简单的EX pressions的消极和积极的向前看符号和lookbehinds。 <?!?!? = < =

Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds. ?! ?<! ?= ?<=

的一些例子

假设示例文本是&LT; TR VALIGN ='顶'&GT;&LT; TD&GT;信天翁&LT; / TD&GT;&LT; / TR&GT;

由于以下的正则表达式的,这是结果,你会看到:

Given the following regex's, these are the results you will see:

TR - 匹配的 D - 匹配的 ^ TD - 不敌的 ^ TR - 不敌的 ^&LT; TR - 匹配的 ^&LT; TR&GT; *&LT; / TR&GT; - 不敌的 ^&LT; TR *&GT; *&LT; / TR&GT; - 匹配的 ^&LT; TR *&GT; *&LT; / TR&GT;(小于TR&GT;?)。 - 匹配的 ^&LT; TR *&GT; *&LT; / TR&GT;(小于TR&GT;?!) - 不敌的 ^&LT; TR *&GT; *&LT; / TR&GT;(小于?!信天翁) - 匹配的 ^&LT; TR *&GT; *&LT; / TR&GT;(小于?!。*信天翁*) - 不敌 ^&LT; TR *&GT; *&LT; / TR&GT;(*信天翁*?!)。 - 不敌的 tr - match td - match ^td - no match ^tr - no match ^<tr - match ^<tr>.*</tr> - no match ^<tr.*>.*</tr> - match ^<tr.*>.*</tr>(?<tr>) - match ^<tr.*>.*</tr>(?<!tr>) - no match ^<tr.*>.*</tr>(?<!Albatross) - match ^<tr.*>.*</tr>(?<!.*Albatross.*) - no match ^(?!.*Albatross.*)<tr.*>.*</tr> - no match

说明

前两个比赛,因为正则表达式可以在样品(或测试)字符串的任何地方适用。后两个不匹配,因为^表示,从头开始,而测试字符串不以TD和TR - 它有左尖括号开始

The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.

第五个例子匹配,因为测试字符串&LT启动; TR 。 第六不会的,因为它想要的样本字符串开始与&LT; TR&GT; ,用右尖括号紧跟在 TR ,但在实际测试字符串,开幕 TR 包括 VALIGN 属性,因此接下来 TR 是一个空间。第七届正则表达式显示了如何让空间和通配符的属性。

The fifth example matches because the test string starts with <tr. The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.

第八届正则表达式施加正向后断言的正则表达式的结尾,用&LT; 。它说,与整个正则表达式只有当什么,立刻precedes光标在测试字符串,匹配什么在括号,继&LT; 。在这种情况下,接下来就是 TR&GT; 。经过评估`^&LT; TR *&GT; *&LT; / TR&GT; ,光标在测试字符串定位在测试字符串的结尾。因此, TR&GT; 的匹配测试字符串,它的计算结果为TRUE的结束。因此,积极的后向计算结果为真,所以整个正则表达式匹配。

The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating `^<tr.*>.*</tr>, the cursor in the test string is positioned at the end of the test string. Therefore, the tr> is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.

第九示例显示了如何插入一个负向后断言,使用&LT;!。基本上,它说:让正则表达式匹配,如果有什么的后面的光标在这一点上,不匹配的内容如下&LT;!的括号,在这种情况下是 TR&GT; 正则表达式$ P $的pceding断言位, ^&LT; TR *&GT; *。 &LT; / TR&GT; 相匹配并包括字符串的结尾因为模式 TR&GT; 的确实的匹配字符串的结尾。但是,这是一种消极的说法,因此,它的计算结果为FALSE,这意味着9例子不匹配。

The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.

第十例使用另一种负向后插入。基本上,它说:让正则表达式匹配,如果有什么的后面的光标在这一点上,不匹配什么在括号,在这种情况下信天翁正则表达式$ P $的pceding断言位, ^&LT; TR *&GT; *&LT; / TR&GT; 匹配截至及包括的结束。字符串检查信天翁对字符串的结尾产生一个负面的比赛,因为测试字符串结尾&LT; / TR&GT; 由于对中括号内的模式。负后向不匹配,这意味着负后向值为TRUE,这意味着10日的例子是一个匹配。

The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.

第11例扩展了负后向包含通配符;在英语负回顾后的结果是只匹配,如果preceding字符串不包含单词的信天翁的。在这种情况下,测试字符串DOES包括单词,负后向计算为FALSE,并在11日的正则表达式不匹配

The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.

第12例使用否定前向断言。像lookbehinds,向前看符号是零宽度 - 他们不将光标移动测试字符串字符串匹配的目的之内。在这种情况下先行,拒绝字符串向右走,因为 *信天翁* 火柴。;因为它是一个负向前查找,值为FALSE,这意味着整体的正则表达式匹配失败,这意味着对测试字符串正则表达式的评估停止在那里。

The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.

例如12的结果总是相同的布尔值,例如11,但具有不同的行为在运行时。在EX 12,负检查,首先立即执行,以停止。在EX 11,满正则表达式的应用,计算结果为TRUE,则向后断言检查之前。所以,你可以看到,比较向前看符号和lookbehinds时可能会有性能差异。哪一个更适合你取决于你是匹配的,而正赛模式的相对复杂性和消极比赛的格局。

example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.

更多关于这东西,读了在 HTTP://www.regular-ex$p$pssions.info /

For more on this stuff, read up at http://www.regular-expressions.info/

或者得到一个正则表达式求值的工具,并尝试了一些测试。

Or get a regex evaluator tool and try out some tests.

这样的工具:

like this tool:

源代码和二进制