我要寻找从achore标签的HREF。所以,我已经使用正则表达式为
I want to find the href from an achore tag. So I have used regex as
<a\s*[^>]*\s*href\s*\=\s*([^(\s*|\>)]*)\s*[^>]*>\s*Text\s*<\/a>
Options = Ignorecase + singleline
示例
<a href="/abc/xzy/pqr.com" class="m">Text</a>
So Group[1]="/abc/xzy/pqr.com"
但是如果该含量如
But If the content is like
<a href="/abc/xzy/ //Contains new line
pqr.com" class="m">Text</a>
so Group[1]="/abc/xzy/
所以,我想知道如何得到/abc/xzy/pqr.com如果内容包含新行(\ r \ n)的
So I want to know how to get "/abc/xzy/pqr.com" if the content contains new line(\r\n)
您捕获组是有点不可思议: [^(\ s * | \&GT;)] *
是一个字符类,它会匹配任何字符不是(
,ROR一个字符类 \ S
,也不是一个星号 *
等。
Your capture group is a bit weird: [^(\s*|\>)]*
is a character class and it will match any character not (
, ror a character class \s
, nor an asterisk *
, etc.
有什么可以做的不过是之前把报价和捕获组后:
What you can do however is to put quotes before and after the capture group:
<a\s*[^>]*\s*href\s*\=\s*"([^(\s*|\>)]*)"\s*[^>]*>\s*Text\s*<\/a>
^ ^
然后是字符类改为 [^]
(没有引号):
<a\s*[^>]*\s*href\s*\=\s*"([^"]*)"\s*[^>]*>\s*Text\s*<\/a>
^^^^
regex101演示。
这表示,这将是更好地使用正则表达式的一个适当的HTML解析器来代替。只是,它更乏味,使合适的正则表达式,因为你可以忘掉很多不同的情况,但如果你一定怎么样的数据来通过,正则表达式可能是一个快速的方法来得到你所需要的。
This said, it would be better to use a proper html parser instead of regex. It's just that it's more tedious to make a suitable regex because you can forget about a lot of different scenarios, but if you're certain of how your data comes through, regex might be a quick way to get what you need.
如果你要考虑单引号和没有引号在所有在某些情况下,你可以试试这个:
If you want to consider single quotes and no quotes at all in some cases, you might try this instead:
<a\s*[^>]*\s*href\s*=\s*((?:[^ ]|[\n\r])+)\s*[^>]*>\s*Text\s*<\/a>
更新regex101 。
这正则表达式有这一部分,而不是(:[^] | [\ñ\ r])+
,它接受的非空间和换行符(和回车就在外壳)。需要注意的是 \ S
包含空格,制表符,换行符和换页。
This regex has this part instead (?:[^ ]|[\n\r])+
which accepts non-spaces and newlines (and carriage returns just in case). Note that \s
contains white spaces, tabs, newlines and form-feed.