普通EX pression匹配的通用网址通用网址、普通、EX、pression

2023-09-04 01:25:14 作者:⒈個人就好

我已经找遍了,还没有找到一个解决方案,以满足我的需要进行常规的前pression模式,将匹配一个通用网址。我需要支持多种协议(核查),本地主机和/或IP地址,端口和查询字符串。一些例子:

I've looked all over and have yet to find a single solution to address my need for a regular expression pattern that will match a generic URL. I need to support multiple protocols (with verification), localhost and/or IP addressing, ports and query strings. Some examples:

的http://本地主机/ mysite的 的https://本地主机:55000 ftp://192.1.1.1 的telnet://somesite/page.htm一个= 1和b = 2 http://localhost/mysite https://localhost:55000 ftp://192.1.1.1 telnet://somesite/page.htm?a=1&b=2

理想情况下,我想的图案也支持提取的各种元件(协议,主机,端口查询字符串等),但是这不是一个必要条件。

Ideally, I'd like the pattern to also support extracting the various elements (protocol, host, port, query string, etc.) but this is not a requirement.

(此外,为自己和目的,未来的读者,如果你能解释的格局,这将是有益的。)

(Also, for the purposes of myself and future readers, if you could explain the pattern, it would be helpful.)

推荐答案

尼古拉斯·凯里是正确引导你走向RFC-3986。正则表达式,他指出,将匹配一个通用的URI,但不会对其进行验证(这正则表达式是不利于采摘网址出的野的 - 它过于松散,匹配几乎所有的字符串包括空字符串)。

Nicholas Carey is correct to steer you towards RFC-3986. The regex he points out will match a generic URI, but it will not validate it (and this regex is not good for picking URLs out of "the wild" - it is too loose and matches just about any string including an empty string).

关于确认的要求,你可能要看一看的文章我写关于这个问题,这需要从附录A中的所有的各个组成部分的所有ABNF语法定义,并提供正则表达式等价:

Regarding the validation requirement, you may want to take a look at an article I wrote on the subject, which takes from Appendix A all the ABNF syntax definitions of all the various components and provides regex equivalents:

普通防爆pression URI验证

对于由野挑选出的URL的主题,看看杰夫·阿特伍德的的的问题,网址和约翰'格鲁伯的的改进的自由,精确的正则表达式模式中来匹配网址的博客文章窥见为一些能产生微妙的问题。此外,你可能想看看一个项目,我从去年开始:网址Linkification - 此挑选出未链接的HTTP和FTP的URL来从可能已经有一些链接文本。

Regarding the subject of picking out URL's from the "wild", take a look at Jeff Atwood's "The Problem With URLs" and John' Gruber's "An Improved Liberal, Accurate Regex Pattern for Matching URLs" blog posts to get a glimpse as to some of the subtle problems which can arise. Also, you may want to take a look at a project I started last year: URL Linkification - this picks out unlinked HTTP and FTP URLs from text which may already have some links.

这是说,下面是一个PHP函数,它使用RFC-3986略加修改版本绝对URI正则表达式来验证HTTP和FTP的URL(与此正则表达式,命名的主机的部分不能为空)。 URI的所有的各个部件被隔离和捕获到命名组从而方便了操作中的程序code中的份和验证:

That said, the following is a PHP function which uses a slightly modified version of the RFC-3986 "Absolute URI" regex to validate HTTP and FTP URL's (with this regex, the named host portion must not be empty). All the various components of the URI are isolated and captured into named groups which allows for easy manipulation and validation of the parts within the program code:

function url_valid($url)
{
    if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
    if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
    if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
        /mx', $url, $m)) return FALSE;
    switch ($m['scheme'])
    {
    case 'https':
    case 'http':
        if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
        break;
    case 'ftps':
    case 'ftp':
        break;
    default:
        return FALSE;   // Unrecognised URI scheme. Default to FALSE.
    }
    // Validate host name conforms to DNS "dot-separated-parts".
    if ($m{'regname'}) // If host regname specified, check for DNS conformance.
    {
        if (!preg_match('/# HTTP DNS host name.
            ^                      # Anchor to beginning of string.
            (?!.{256})             # Overall host length is less than 256 chars.
            (?:                    # Group dot separated host part alternatives.
              [0-9A-Za-z]\.        # Either a single alphanum followed by dot
            |                      # or... part has more than one char (63 chars max).
              [0-9A-Za-z]          # Part first char is alphanum (no dash).
              [\-0-9A-Za-z]{0,61}  # Internal chars are alphanum plus dash.
              [0-9A-Za-z]          # Part last char is alphanum (no dash).
              \.                   # Each part followed by literal dot.
            )*                     # One or more parts before top level domain.
            (?:                    # Explicitly specify top level domains.
              com|edu|gov|int|mil|net|org|biz|
              info|name|pro|aero|coop|museum|
              asia|cat|jobs|mobi|tel|travel|
              [A-Za-z]{2})         # Country codes are exqactly two alpha chars.
            $                      # Anchor to end of string.
            /ix', $m['host'])) return FALSE;
    }
    $m['url'] = $url;
    for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
    return $m; // return TRUE == array of useful named $matches plus the valid $url.
}

第一正则表达式验证串作为绝对(具有非空的宿主的部分)的通用URI。第二个正则表达式是用于验证(命名)的主机的部分(如果它不是一个IP文字或IPv4地址)关于DNS查询系统(每个点分隔的子域名是63个字符或少由数字,字母和破折号,总体长度小于255个字符。)

The first regex validates the string as an absolute (has a non-empty host portion) generic URI. A second regex is used to validate the (named) host portion (when it is not an IP literal or IPv4 address) with regard to the DNS lookup system (where each dot-separated subdomain is 63 chars or less consisting of digits, letters and dashes, with an overall length less than 255 chars.)

请注意,这个函数的结构可方便地扩展到包括其他的方案。

Note that the structure of this function allows easy expansion to include other schemes.