正则表达式来查找定位标记包括新行的C#.NET标记、正则表达式、NET

2023-09-06 15:20:27 作者:最佳搭档

我要寻找从achore标签的HREF。所以,我已经使用正则表达式为

I want to find the href from an achore tag. So I have used regex as

 <a\s*[^>]*\s*href\s*\=\s*([^(\s*|\>)]*)\s*[^>]*>\s*Text\s*<\/a>
 Options = Ignorecase + singleline

示例

    <a href="/abc/xzy/pqr.com" class="m">Text</a>
So Group[1]="/abc/xzy/pqr.com"

但是如果该含量如

But If the content is like

     <a href="/abc/xzy/                     //Contains new line
    pqr.com" class="m">Text</a>  


so Group[1]="/abc/xzy/

所以,我想知道如何得到/abc/xzy/pqr.com如果内容包含新行(\ r \ n)的

So I want to know how to get "/abc/xzy/pqr.com" if the content contains new line(\r\n)

推荐答案

您捕获组是有点不可思议: [^(\ s * | \&GT;)] * 是一个字符类,它会匹配任何字符不是,ROR一个字符类 \ S ,也不是一个星号 * 等。

Your capture group is a bit weird: [^(\s*|\>)]* is a character class and it will match any character not (, ror a character class \s, nor an asterisk *, etc.

有什么可以做的不过是之前把报价和捕获组后:

What you can do however is to put quotes before and after the capture group:

<a\s*[^>]*\s*href\s*\=\s*"([^(\s*|\>)]*)"\s*[^>]*>\s*Text\s*<\/a>
                         ^              ^

然后是字符类改为 [^] (没有引号):

<a\s*[^>]*\s*href\s*\=\s*"([^"]*)"\s*[^>]*>\s*Text\s*<\/a>
                           ^^^^

regex101演示。

这表示,这将是更好地使用正则表达式的一个适当的HTML解析器来代替。只是,它更乏味,使合适的正则表达式,因为你可以忘掉很多不同的情况,但如果你一定怎么样的数据来通过,正则表达式可能是一个快速的方法来得到你所需要的。

This said, it would be better to use a proper html parser instead of regex. It's just that it's more tedious to make a suitable regex because you can forget about a lot of different scenarios, but if you're certain of how your data comes through, regex might be a quick way to get what you need.

如果你要考虑单引号和没有引号在所有在某些情况下,你可以试试这个:

If you want to consider single quotes and no quotes at all in some cases, you might try this instead:

<a\s*[^>]*\s*href\s*=\s*((?:[^ ]|[\n\r])+)\s*[^>]*>\s*Text\s*<\/a>

更新regex101 。

这正则表达式有这一部分,而不是(:[^] | [\ñ\ r])+ ,它接受的非空间和换行符(和回车就在外壳)。需要注意的是 \ S 包含空格,制表符,换行符和换页。

This regex has this part instead (?:[^ ]|[\n\r])+ which accepts non-spaces and newlines (and carriage returns just in case). Note that \s contains white spaces, tabs, newlines and form-feed.