为什么编译正则表达式的性能比PTED正则表达式Intre $ P $慢?正则表达式、性能、Intre、PTED

2023-09-02 23:50:55 作者:东京雨季

我碰上了这篇文章:

Performance:编译主场迎战国米preTED普通防爆pressions 的,我修改了样品code编译千正则表达式,然后运行各500次,以充分利用precompilation,但即使在案例跨preTED正则表达式运行速度快4倍!

这意味着 RegexOptions.Compiled 选项是完全无用的,其实更糟糕,这是慢!大差异是由于JIT,解决JIT后在下面的code编译正则表达式仍然表现有点慢,没有道理给我,但@Jim在答案提供了按预期工作一个更清洁的版本。

任何人都可以解释为什么是这样?

code,采取和放大器;从博客文章修改:

 使用系统;
使用System.Collections.Generic;
使用System.Linq的;
使用System.Text;
使用System.Text.RegularEx pressions;

命名空间RegExTester
{
    类节目
    {
        静态无效的主要(字串[] args)
        {
            日期时间的startTime = DateTime.Now;

            的for(int i = 0; I< 1000;我++)
            {
                CheckForMatches(一些随机的文字与电子邮件地址,address@domain200.com+ i.ToString());
            }


            双msTaken = DateTime.Now.Subtract(startTime时).TotalMilliseconds;
            Console.WriteLine(全运行:+ msTaken);


            的startTime = DateTime.Now;

            的for(int i = 0; I< 1000;我++)
            {
                CheckForMatches(一些随机的文字与电子邮件地址,address@domain200.com+ i.ToString());
            }


            msTaken = DateTime.Now.Subtract(startTime时).TotalMilliseconds;
            Console.WriteLine(全运行:+ msTaken);

            到Console.ReadLine();

        }


        私有静态列表<正则表达式> _ex pressions;
        私有静态对象_SyncRoot =新的对象();

        私有静态列表<正则表达式> GETEX pressions()
        {
            如果(_ex pressions!= NULL)
                返回_ex pressions;

            锁定(_SyncRoot)
            {
                如果(_ex pressions == NULL)
                {
                    日期时间的startTime = DateTime.Now;

                    名单<正则表达式> tempEx pressions =新的名单,其中,正则表达式>();
                    串regExPattern =
                        @^ [A-ZA-Z0-9] + [A-ZA-Z0-9 ._% - ] * @ {0} $;

                    的for(int i = 0; I< 2000;我++)
                    {
                        tempEx pressions.Add(新正则表达式(
                            的String.Format(regExPattern,
                            Regex.Escape(域+ i.ToString()+。+
                            (I%3 == 0的.com:.NET)))
                            RegexOptions.IgnoreCase)); // | RegexOptions.Compiled
                    }

                    _ex pressions =新的名单,其中,正则表达式>(tempEx pressions);
                    日期时间endTime的= DateTime.Now;
                    双msTaken = endTime.Subtract(startTime时).TotalMilliseconds;
                    Console.WriteLine(初始化+ msTaken);
                }
            }

            返回_ex pressions;
        }

        静态列表<正则表达式> EX pressions = GETEX pressions();

        私有静态无效CheckForMatches(文本字符串)
        {

            日期时间的startTime = DateTime.Now;


                的foreach(正则表达式E在EX pressions)
                {
                    布尔isMatch = e.IsMatch(文本);
                }


            日期时间endTime的= DateTime.Now;
            //双msTaken = endTime.Subtract(startTime时).TotalMilliseconds;
            //Console.WriteLine("Run:+ msTaken);

        }
    }
}
 

解决方案

编译正EX pressions匹配更快的按照规定使用时。正如其他人所指出的那样,这个想法是一次编译和使用它们很多次。建设和初始化时间是摊销出了一颗颗运行。

我创建了一个非常简单的测试会告诉你,编译正EX pressions毫无疑问是快于不进行编译。

  const int的NumIterations = 1000;
    常量字符串的TestString =一些随机的文字与电子邮件地址,address@domain200.com;
    常量字符串模式=。^ [A-ZA-Z0-9] + [A-ZA-Z0-9 ._% - ] * @ domain0 \\ COM $;
    私有静态正则表达式NormalRegex ​​=新的正则表达式(模式,RegexOptions.IgnoreCase);
    私有静态正则表达式CompiledRegex ​​=新的正则表达式(模式,RegexOptions.IgnoreCase | RegexOptions.Compiled);
    私有静态正则表达式DummyRegex ​​=新的正则表达式(^ $);

    静态无效的主要(字串[] args)
    {
        VAR DoTest =新动作<字符串,正则表达式,INT>((S,R,计数)=>
            {
                Console.Write(测试{0} ...,S);
                秒表SW = Stopwatch.StartNew();
                的for(int i = 0; I<计数; ++ I)
                {
                    布尔isMatch = r.IsMatch(的TestString + i.ToString());
                }
                sw.Stop();
                Console.WriteLine({0:N0}毫秒,sw.ElapsedMilliseconds);
            });

        //确保DoTest是JIT编译
        DoTest(虚拟,DummyRegex,1);
        DoTest(正常第一次,NormalRegex,1);
        DoTest(普通正则表达式,NormalRegex,NumIterations);
        DoTest(编译第一次,CompiledRegex,1);
        DoTest(编译,CompiledRegex,NumIterations);

        Console.WriteLine();
        Console.Write(Done(完成)preSS输入:);
        到Console.ReadLine();
    }
 

设置 NumIterations 500给了我这样的:

 测试假人... 0毫秒
测试标准第一次... 0毫秒
测试标准正则表达式... 1毫秒
测试编译第一次... 13毫秒
测试编译... 1毫秒
 
正则表达式学习笔记

500万次迭代,我得到:

 测试假人... 0毫秒
测试标准第一次... 0毫秒
测试标准正则表达式... 17232毫秒
测试编译第一次... 17毫秒
测试编译... 15299毫秒
 

在这里您将看到编译正EX pression比未编译版本快10%以上。

这是有趣的是,如果你删除 RegexOptions.IgnoreCase 从常规的前pression,500万次迭代的结果更是惊人:

 测试假人... 0毫秒
测试标准第一次... 0毫秒
测试标准正则表达式... 12869毫秒
测试编译第一次... 14毫秒
测试编译... 8332毫秒
 

在这里,编译正EX pression比不编译正恩pression快35%。

在我看来,您引用的博客文章仅仅是一个有缺陷的测试。

I run into this article:

Performance: Compiled vs. Interpreted Regular Expressions, I modified the sample code to compile 1000 Regex and then run each 500 times to take advantage of precompilation, however even in that case interpreted RegExes run 4 times faster!

This means RegexOptions.Compiled option is completely useless, actually even worse, it's slower! Big difference was due to JIT, after solving JIT compiled regex in the the following code still performs a little bit slow and doesn't make sense to me but @Jim in the answers provided a much cleaner version which works as expected.

Can anyone explain why this is the case?

Code, taken & modified from the blog post:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace RegExTester
{
    class Program
    {
        static void Main(string[] args)
        {
            DateTime startTime = DateTime.Now;

            for (int i = 0; i < 1000; i++)
            {
                CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());    
            }


            double msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
            Console.WriteLine("Full Run: " + msTaken);


            startTime = DateTime.Now;

            for (int i = 0; i < 1000; i++)
            {
                CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());
            }


            msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
            Console.WriteLine("Full Run: " + msTaken);

            Console.ReadLine();

        }


        private static List<Regex> _expressions;
        private static object _SyncRoot = new object();

        private static List<Regex> GetExpressions()
        {
            if (_expressions != null)
                return _expressions;

            lock (_SyncRoot)
            {
                if (_expressions == null)
                {
                    DateTime startTime = DateTime.Now;

                    List<Regex> tempExpressions = new List<Regex>();
                    string regExPattern =
                        @"^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@{0}$";

                    for (int i = 0; i < 2000; i++)
                    {
                        tempExpressions.Add(new Regex(
                            string.Format(regExPattern,
                            Regex.Escape("domain" + i.ToString() + "." +
                            (i % 3 == 0 ? ".com" : ".net"))),
                            RegexOptions.IgnoreCase));//  | RegexOptions.Compiled
                    }

                    _expressions = new List<Regex>(tempExpressions);
                    DateTime endTime = DateTime.Now;
                    double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
                    Console.WriteLine("Init:" + msTaken);
                }
            }

            return _expressions;
        }

        static  List<Regex> expressions = GetExpressions();

        private static void CheckForMatches(string text)
        {

            DateTime startTime = DateTime.Now;


                foreach (Regex e in expressions)
                {
                    bool isMatch = e.IsMatch(text);
                }


            DateTime endTime = DateTime.Now;
            //double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
            //Console.WriteLine("Run: " + msTaken);

        }
    }
}

解决方案

Compiled regular expressions match faster when used as intended. As others have pointed out, the idea is to compile them once and use them many times. The construction and initialization time are amortized out over those many runs.

I created a much simpler test that will show you that compiled regular expressions are unquestionably faster than not compiled.

    const int NumIterations = 1000;
    const string TestString = "some random text with email address, address@domain200.com";
    const string Pattern = "^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@domain0\\.\\.com$";
    private static Regex NormalRegex = new Regex(Pattern, RegexOptions.IgnoreCase);
    private static Regex CompiledRegex = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
    private static Regex DummyRegex = new Regex("^.$");

    static void Main(string[] args)
    {
        var DoTest = new Action<string, Regex, int>((s, r, count) =>
            {
                Console.Write("Testing {0} ... ", s);
                Stopwatch sw = Stopwatch.StartNew();
                for (int i = 0; i < count; ++i)
                {
                    bool isMatch = r.IsMatch(TestString + i.ToString());
                }
                sw.Stop();
                Console.WriteLine("{0:N0} ms", sw.ElapsedMilliseconds);
            });

        // Make sure that DoTest is JITed
        DoTest("Dummy", DummyRegex, 1);
        DoTest("Normal first time", NormalRegex, 1);
        DoTest("Normal Regex", NormalRegex, NumIterations);
        DoTest("Compiled first time", CompiledRegex, 1);
        DoTest("Compiled", CompiledRegex, NumIterations);

        Console.WriteLine();
        Console.Write("Done. Press Enter:");
        Console.ReadLine();
    }

Setting NumIterations to 500 gives me this:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 1 ms
Testing Compiled first time ... 13 ms
Testing Compiled ... 1 ms

With 5 million iterations, I get:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 17,232 ms
Testing Compiled first time ... 17 ms
Testing Compiled ... 15,299 ms

Here you see that the compiled regular expression is at least 10% faster than the not compiled version.

It's interesting to note that if you remove the RegexOptions.IgnoreCase from your regular expression, the results from 5 million iterations are even more striking:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 12,869 ms
Testing Compiled first time ... 14 ms
Testing Compiled ... 8,332 ms

Here, the compiled regular expression is 35% faster than the not compiled regular expression.

In my opinion, the blog post you reference is simply a flawed test.