正则表达式的网址,包括查询字符串字符串、网址、正则表达式

2023-09-03 01:31:33 作者:他葬爱我葬忆

我以为这是一个简单的谷歌搜索,但显然不是。什么是正则表达式我可以使用C#来解析出一个URL的包括从一个较大的文本的查询字符串的?我花了很多时间,发现很多的那些不包括查询字符串的例子。我不能用的System.Uri,因为假设你已经有了网址...我需要找到它周围的文本。

I thought this would be a simple google search but apparently not. What is a regex I can use in C# to parse out a URL including any query string from a larger text? I have spent lots of time and found lots of examples of ones that don't include the query string. And I can't use System.URI, because that assumes you already have the URL... I need to find it in surrounding text.

推荐答案

这应该得到公正的东西(随意添加附加议定书):

This should get just about anything (feel free to add additional protocols):

@"(https?|ftp|file)\://[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*"

真正的困难是找到结束。由于是,这种模式依赖于找到一个无效的字符。这将是任何非字母,数字,连字号或句其他域名,或任何东西比加斜杠其他(/),问号结束前,与号(安培)(?),等号(=) ,分号(;)(!),加号(+),感叹号,省略号/单引号('),打开/关闭括号,星号(*),下划线(_),波浪线,或百分号( %)的域名之后。

The real difficulty is finding the end. As is, this pattern relies on finding an invalid character. That would be anything other than letters, numbers, hyphen or period before the end of the domain name, or anything other than those plus forward slash (/), question mark (?), ampersand (&), equals sign (=), semicolon (;), plus sign (+), exclamation point (!), apostrophe/single quote ('), open/close parentheses, asterisk (*), underscore (_), tilde (~), or percent sign (%) after the domain name.

请注意,这将允许无效的网址像

Note that this would allow invalid URLs like

http://../

和它便拿起东西的URL后,如在此字符串:

And it would pick up stuff after a URL, such as in this string:

也许你应该尝试 http://www.google.com 。

其中,http://www.google.com。(与尾随句点)将进行匹配。

Where "http://www.google.com." (with the trailing period) would be matched.

这也将错过并非始于一个协议规范(具体地说,第一组括号内的协议,例如,它会错过这个字符串的URL网址:

It would also miss URLs that didn't begin with a protocol specification (specifically, the protocols within the first set of parentheses. For instance, it would miss the URL in this string:

也许你应该尝试www.google.com。

Maybe you should try www.google.com.

这是很难得到每一个案件没有一些更好的定义的边界。

It's very difficult to get every case without some better-defined boundaries.