米扑导航和米扑域名,需要从URL中获取根域名,总结方法分享出来。

从url中找到域名,首先想到的是正则匹配,然后寻找相应的类库和代码实现。

正则解析匹配,有很多不完备的地方,例如:url中有域名,域名后缀一直在不断增加等。

网上查到几种方法,一种是用Python中自带的模块和正则相结合来解析域名,另一种是使第三方用写好的解析模块直接解析出域名。

 

方法1: 正则匹配 推荐

1. 封装类函数

class YGCheckICP(object):
    
    topRootDomain = (
                    '.com','.la','.io','.co','.info','.net','.org','.me','.mobi',
                    '.us','.biz','.xxx','.ca','.co.jp','.com.cn','.net.cn',
                    '.org.cn','.mx','.tv','.ws','.ag','.com.ag','.net.ag',
                    '.org.ag','.am','.asia','.at','.be','.com.br','.net.br',
                    '.bz','.com.bz','.net.bz','.cc','.com.co','.net.co',
                    '.nom.co','.de','.es','.com.es','.nom.es','.org.es',
                    '.eu','.fm','.fr','.gs','.in','.co.in','.firm.in','.gen.in',
                    '.ind.in','.net.in','.org.in','.it','.jobs','.jp','.ms',
                    '.com.mx','.nl','.nu','.co.nz','.net.nz','.org.nz',
                    '.se','.tc','.tk','.tw','.com.tw','.idv.tw','.org.tw',
                    '.hk','.co.uk','.me.uk','.org.uk','.vg', ".com.hk")
    
    @classmethod
    def get_domain_root(cls, url):
        domain_root = ""
        try:
            ## 若不是 http或https开头,则补上方便正则匹配规则
            if url[0:4] != "http" and url[0:5] != "https" :
                url = "http://" + url
                
            reg = r'[^\.]+('+'|'.join([h.replace('.',r'\.') for h in YGCheckICP.topRootDomain])+')$'
            pattern = re.compile(reg, re.IGNORECASE)
            
            parts = urlparse(url)
            host = parts.netloc
            m = pattern.search(host)
            res =  m.group() if m else host
            domain_root = "-" if not res else res
        except Exception, ex:
            print("get_domain_root() -- error_msg: " + str(ex))
        return domain_root

 

2. 主函数

if __name__ == '__main__':
    urls = [
            "mimvp.com",
            "shop.mimvp.com",
            "pay.shop.mimvp.com",
            "http://mimvp.com/index.html",
            "https://mimvp.com/index.html",
            "https://shop.mimvp.com/index.html",
            "https://shop.mimvp.net",
            "https://shop.mimvp.cn",
            "https://shop.mimvp.mobi",
            "https://shop.mimvp.com.cn",
            "file:///D:/icp/api/index.html",
            "http://api.mimvp.org",
            "http://127.0.0.1:8000"
          ]
     
    for url in urls:
        print YGCheckICP.get_domain_root(url)

 

3. 运行结果

mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.net
shop.mimvp.cn
mimvp.mobi
mimvp.com.cn
file:
mimvp.org
127.0.0.1:8000

结果说明:发现对 file:// 正则匹配不好,没有过滤掉,因此需要改进如下:

@classmethod
def get_domain_root(cls, url):
    domain_root = ""
    try:
        ## 若不是 http或https开头,则补上方便正则匹配规则
        if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https":
            url = "http://" + url
            
        reg = r'[^\.]+('+'|'.join([h.replace('.',r'\.') for h in YGCheckICP.topRootDomain])+')$'
        pattern = re.compile(reg, re.IGNORECASE)
        
        parts = urlparse(url)
        host = parts.netloc
        m = pattern.search(host)
        res =  m.group() if m else host
        domain_root = "-" if not res else res
    except Exception, ex:
        print("get_domain_root() -- error_msg: " + str(ex))
    return domain_root

修改为:

if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https" 

含义是 url 中没有包含了 "://",则添加上 http:// 或 https:// ,可以解决 file:// 或 ftp:// 等问题

 

 

方法2: urllib 类解析 不推荐

1. 封装类函数

class YGCheckICP(object):
    
    @classmethod
    def get_domain_root2(cls, url):
        domain_root = ""
        try:
            proto, rest = urllib.splittype(url)
            res, rest = urllib.splithost(rest)
            domain_root = "-" if not res else res
        except Exception, ex:
            print("check_icp_beian() -- error_msg: " + str(ex))
        return domain_root

 

2. 主函数

if __name__ == '__main__':
    urls = [
            "mimvp.com",
            "shop.mimvp.com",
            "pay.shop.mimvp.com",
            "http://mimvp.com/index.html",
            "https://mimvp.com/index.html",
            "https://shop.mimvp.com/index.html",
            "https://shop.mimvp.net",
            "https://shop.mimvp.cn",
            "https://shop.mimvp.mobi",
            "https://shop.mimvp.com.cn",
            "file:///D:/icp/api/index.html",
            "http://api.mimvp.org",
            "http://127.0.0.1:8000"
          ]
     
    for url in urls:
        print YGCheckICP.get_domain_root2(url)

 

3. 运行结果

-
-
-
mimvp.com
mimvp.com
shop.mimvp.com
shop.mimvp.net
shop.mimvp.cn
shop.mimvp.mobi
shop.mimvp.com.cn
-
api.mimvp.org
127.0.0.1:8000

结果说明:发现对已是根域名(mimvp.com)或二级、三级域名,并不能获取根域名,只是去掉了 http 等,并没有达到我们的预期需求

 

 

方法3: 使用第三方库可接受,但不推荐,毕竟依赖第三方库

1. 安装 tld

pip install tld
  Downloading tld-0.7.9-py2.py3-none-any.whl (154kB)
    100% |████████████████████████████████| 163kB 20kB/s 

 

2. 调用库方法

class YGCheckICP(object):
    @classmethod
    def get_domain_root3(cls, url):
        domain_root = ""
        try:
            from tld import get_tld
            
            ## 若不是 http或https开头,则补上方便正则匹配规则
            if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https" :
                url = "http://" + url
                
            domain_root = get_tld(url)
        except Exception, ex:
            domain_root = "-"
        return domain_root

 

3. 主函数

if __name__ == '__main__':
    urls = [
            "mimvp.com",
            "shop.mimvp.com",
            "pay.shop.mimvp.com",
            "http://mimvp.com/index.html",
            "https://mimvp.com/index.html",
            "https://shop.mimvp.com/index.html",
            "https://shop.mimvp.net",
            "https://shop.mimvp.cn",
            "https://shop.mimvp.mobi",
            "https://shop.mimvp.com.cn",
            "file:///D:/icp/api/index.html",
            "http://api.mimvp.org",
            "http://127.0.0.1:8000"
          ]
     
    for url in urls:
        print YGCheckICP.get_domain_root3(url)

 

4. 运行结果

mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.net
mimvp.cn
mimvp.mobi
mimvp.com.cn
-
mimvp.org
-

结果说明:url 必须为 http:// 或 https://开头,否则会抛出异常,处理异常即可

 

 

其他可以使用的解析模块:

tld
tldextract
publicsuffix