普通EX pression发生器/减速?发生器、普通、EX、pression

2023-09-10 23:15:00 作者:花落葬相思

我提出一个有趣的问题,从一个同事的业务痛点,我们现在有,而且很好奇,如果有什么事,在那里(工具/库/算法),这可能有助于自动化这个。

I was posed an interesting question from a colleague for an operational pain point we currently have, and am curious if there's anything out there (utility/library/algorithm) that might help automate this.

假设你有文字值的列表(在我们的情况下,它们的URL)。我们想要做的是,根据这份榜单上,拿出一个单一的正则表达式匹配所有这些文字的项目。

Say you have a list of literal values (in our cases, they are URLs). What we want to do is, based on this list, come up with a single regex that matches all of those literal items.

所以,如果我的名单是:

So, if my list is:

http://www.abc.com
http://www.abc.com/subdir
http://foo.abc.com

最简单的答案是

The simplest answer is

^(http://www.abc.com|http://www.abc.com/subdir|http://foo.abc.com)$

但这种变大了大量的数据,我们有一个长度的限制,我们正在努力留在。

but this gets large for lots of data, and we have a length limit we're trying to stay under.

目前,我们手工编写的正则表达式,但是这并不规模非常好,也不是一个伟大的利用任何人的时间。有分解源数据的更自动化的方式来了一个长度最佳的正则表达式匹配所有的源值吗?

Currently we manually write the regexes but this doesn't scale very well nor is it a great use of anyone's time. Is there a more automated way of decomposing the source data to come up with a length-optimal regex that matches all of the source values?

推荐答案

在阿霍Corasick 匹配算法构建了一个有限自动机来匹配多个字符串。你可以在自动机转换为其对应的正则表达式,但它是简单直接使用自动机(这是该算法一样。)

The Aho-Corasick matching algorithm constructs a finite automaton to match multiple strings. You could convert the automaton to its equivalent regex but it is simpler to use the automaton directly (this is what the algorithm does.)