为什么C#编译正EX pressions比同等的字符串方法更快?更快、字符串、方法、EX

2023-09-03 05:45:09 作者:不思量自难忘°

我每次都做简单围堵或替换操作的字符串,在那里我正在寻找这个词是一个固定值,我发现,如果我把我的样本输入,并做一些分析就可以了,使用编译常规的前pression几乎*总是比使用等效的方法从String类更快。

我已经尝试比较多种方法( HS 是大海捞针来搜索, NDL 是针来搜索, REPL 是重置价值。正则表达式总是与 RegexOptions.Compiled 选项):

hs.Replace(NDL,REPL) VS regex.Replace(HS,REPL) hs.Contains(NDL) VS regex.IsMatch(HS)

我已经找到了不少的讨论重点的这的两个技术是快(1, 2 ,3,别人和负荷),但这些讨论似乎总是集中在:

使用字符串版本进行简单的操作和正则表达式复杂的操作(其中,从原始性能的角度来看,甚至不似乎是必然是个好主意),或 运行测试并比较两个(及等价测试,正则表达式版本似乎总是有更好的表现)。

我不明白,这可怎么可能是这样的:如何在正则表达式引擎比较任意两个字符串的子串的速度比等效的字符串版本?这似乎是正确的搜索空间是非常小或非常大,或搜索条件是或大或小,或搜索词不分先后在搜索空间出现。

因此​​,为什么是正规EX pressions更快?

*事实上,在仅情况下,我已经成功地显示字符串版本比编译正则表达式正在寻找一个空字符串时更快!其他情况下,从单个字符的字符串很长的字符串更快编译的正则表达式比等效字符串的方法进行处理。

更新:增加了一个条款,澄清,我正在寻找在那里的搜索词是在编译时已知的情况。对于动态或一次性行动,编制定期EX pression的开销往往会影响结果赞成字符串的方法。

解决方案   

我不明白,这可怎么可能是这样的:如何在正则表达式引擎比较任意两个字符串的子串的速度比等效的字符串版本?

我能想到的两个原因:

在正则表达式中使用一些智能算法如博耶·摩尔(O (M / N)),而简单的字符串操作比较简单的针在草堆每个位置(O(N * M))。 在他们没有真正做同样的事情。例如,一个可以做区域性不变的匹配,而另一个则文化相关的匹配,这可能使一个性能上的差异。

Every time I have to do simple containment or replacement operations on strings, where the term that I'm searching for is a fixed value, I find that if I take my sample input and do some profiling on it, using a compiled regular expression is nearly* always faster than using the equivalent method from the String class.

I've tried comparing a variety of methods ( hs is the "haystack" to search, ndl is the "needle" to search for, repl is the replacement value. regex is always created with the RegexOptions.Compiled option ):

hs.Replace( ndl, repl ) vs regex.Replace( hs, repl ) hs.Contains( ndl ) vs regex.IsMatch( hs ) 方舟编译器官方下载 华为方舟编译器下载 v0.2 32位 64位正式版

I've found quite a few discussions focusing on which of the two techniques are faster (1, 2, 3, and loads of others), but those discussions always seem to focus on:

Use the string version for simple operations and regex for complex operations (which, from a raw performance perspective, doesn't even seem to be necessarily a good idea), or Run a test and compare the two ( and for equivalent tests, the regex version seems to always perform better ).

I don't understand how this can possibly be the case: how does the regex engine compare any two strings for substring matches faster than the equivalent string version? This seems to hold true for search spaces that are very small or very large, or search terms that are small or large, or whether the search term occurs early or late in the search space.

So, why are regular expressions faster?

* In fact, the only case I've managed to show that the string version is faster than a compiled regex is when searching an empty string! Any other case, from single character strings to very long strings are processed faster by a compiled regex than the equivalent string method.

Update: Added a clause to clarify that I'm looking at cases where the search term is known at compile time. For dynamic or one-time operations, the overhead of compiling the regular expression will tend to skew the results in favor of the string methods.

解决方案

I don't understand how this can possibly be the case: how does the regex engine compare any two strings for substring matches faster than the equivalent string version?

I can think of two reasons:

The regex is using some smart algorithm like Boyer Moore (O(M/N)) while the simple string operation simply compares the needle to each position in the haystack (O(N*M)). They're not really doing the same thing. For example, one might do culture-invariant matching while the other does culture-dependent matching, which might make a performance difference.