您好!
我的工作经常EX pression在.NET项目中得到一个特定的标记。我想整个DIV标签,其内容匹配:
I'm working on a regular expression in a .NET project to get a specific tag. I would like to match the entire DIV tag and its contents:
<html>
<head><title>Test</title></head>
<body>
<p>The first paragraph.</p>
<div id='super_special'>
<p>The Store paragraph</p>
</div>
</body>
</head>
code:
Code:
Regex re = new Regex("(<div id='super_special'>.*?</div>)", RegexOptions.Multiline);
if (re.IsMatch(test))
Console.WriteLine("it matches");
else
Console.WriteLine("no match");
我要匹配这样的:
I want to match this:
<div id="super_special">
<p>Anything could go in here...doesn't matter. Let's get it all</p>
</div>
我想。应该让所有的字符,但它似乎有与cariage回报麻烦。什么是我的正则表达式失踪了?
I thought . was supposed to get all characters, but it seems to having trouble with the cariage returns. What is my regex missing?
感谢。
外的开箱即用,无需特殊改性剂,大多数正则表达式实现不超越结束的行来匹配文本。你或许应该看看你使用这样的修改正则表达式引擎的文档。
Out-of-the-box, without special modifiers, most regex implementations don't go beyond the end-of-line to match text. You probably should look in the documentation of the regex engine you're using for such modifier.
我有另外一个建议:提防贪婪!传统上,正则表达式的是贪婪的,这意味着你的正则表达式可能会匹配这样的:
I have one other advice: beware of greed! Traditionally, regex are greedy which means that your regex would probably match this:
<div id="super_special">
I'm the wanted div!
</div>
<div id="not_special">
I'm not wanted, but I've been caught too :(
</div>
您应该检查是否有不贪婪修改器,让你的正则表达式将停止在第一 occurence &LT匹配的文本; / DIV&GT;
,而不是在最后之一。
You should check for a "not-greedy" modifier, so that your regex would stop matching text at the first occurence of </div>
, not at the last one.
此外,正如其他人所说,考虑使用正则表达式的一个HTML解析器来代替。它将为您节省大量的头痛。
Also, as others have said, consider using an HTML parser instead of regexes. It will save you a lot of headache.
编辑:即使是一个非贪婪正则表达式不会按预期或者,如果&LT; DIV&GT;
s的嵌套!另一个原因考虑使用一个HTML解析器。的
even a non-greedy regex wouldn't work as expected either, if <div>
s are nested! Another reason to consider using an HTML parser.