Python获取URL根域名

米扑导航和米扑域名，需要从URL中获取根域名，总结方法分享出来。

从url中找到域名，首先想到的是正则匹配，然后寻找相应的类库和代码实现。

正则解析匹配，有很多不完备的地方，例如：url中有域名，域名后缀一直在不断增加等。

网上查到几种方法，一种是用Python中自带的模块和正则相结合来解析域名，另一种是使第三方用写好的解析模块直接解析出域名。

方法1：正则匹配 （推荐）

1. 封装类函数

class YGCheckICP(object):
    
    topRootDomain = (
                    '.com','.la','.io','.co','.info','.net','.org','.me','.mobi',
                    '.us','.biz','.xxx','.ca','.co.jp','.com.cn','.net.cn',
                    '.org.cn','.mx','.tv','.ws','.ag','.com.ag','.net.ag',
                    '.org.ag','.am','.asia','.at','.be','.com.br','.net.br',
                    '.bz','.com.bz','.net.bz','.cc','.com.co','.net.co',
                    '.nom.co','.de','.es','.com.es','.nom.es','.org.es',
                    '.eu','.fm','.fr','.gs','.in','.co.in','.firm.in','.gen.in',
                    '.ind.in','.net.in','.org.in','.it','.jobs','.jp','.ms',
                    '.com.mx','.nl','.nu','.co.nz','.net.nz','.org.nz',
                    '.se','.tc','.tk','.tw','.com.tw','.idv.tw','.org.tw',
                    '.hk','.co.uk','.me.uk','.org.uk','.vg', ".com.hk")
    
    @classmethod
    def get_domain_root(cls, url):
        domain_root = ""
        try:
            ## 若不是 http或https开头，则补上方便正则匹配规则
            if url[0:4] != "http" and url[0:5] != "https" :
                url = "http://" + url
                
            reg = r'[^\.]+('+'|'.join([h.replace('.',r'\.') for h in YGCheckICP.topRootDomain])+')$'
            pattern = re.compile(reg, re.IGNORECASE)
            
            parts = urlparse(url)
            host = parts.netloc
            m = pattern.search(host)
            res =  m.group() if m else host
            domain_root = "-" if not res else res
        except Exception, ex:
            print("get_domain_root() -- error_msg: " + str(ex))
        return domain_root

2. 主函数

if __name__ == '__main__':
    urls = [
            "mimvp.com",
            "shop.mimvp.com",
            "pay.shop.mimvp.com",
            "http://mimvp.com/index.html",
            "https://mimvp.com/index.html",
            "https://shop.mimvp.com/index.html",
            "https://shop.mimvp.net",
            "https://shop.mimvp.cn",
            "https://shop.mimvp.mobi",
            "https://shop.mimvp.com.cn",
            "file:///D:/icp/api/index.html",
            "http://api.mimvp.org",
            "http://127.0.0.1:8000"
          ]
     
    for url in urls:
        print YGCheckICP.get_domain_root(url)

3. 运行结果

mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.net
shop.mimvp.cn
mimvp.mobi
mimvp.com.cn
file:
mimvp.org
127.0.0.1:8000

结果说明：发现对 file:// 正则匹配不好，没有过滤掉，因此需要改进如下：

@classmethod
def get_domain_root(cls, url):
    domain_root = ""
    try:
        ## 若不是 http或https开头，则补上方便正则匹配规则
        if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https":
            url = "http://" + url
            
        reg = r'[^\.]+('+'|'.join([h.replace('.',r'\.') for h in YGCheckICP.topRootDomain])+')$'
        pattern = re.compile(reg, re.IGNORECASE)
        
        parts = urlparse(url)
        host = parts.netloc
        m = pattern.search(host)
        res =  m.group() if m else host
        domain_root = "-" if not res else res
    except Exception, ex:
        print("get_domain_root() -- error_msg: " + str(ex))
    return domain_root

修改为：

if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https"

含义是 url 中没有包含了 "://"，则添加上 http:// 或 https:// ，可以解决 file:// 或 ftp:// 等问题

方法2： urllib 类解析 （不推荐）

1. 封装类函数

class YGCheckICP(object):
    
    @classmethod
    def get_domain_root2(cls, url):
        domain_root = ""
        try:
            proto, rest = urllib.splittype(url)
            res, rest = urllib.splithost(rest)
            domain_root = "-" if not res else res
        except Exception, ex:
            print("check_icp_beian() -- error_msg: " + str(ex))
        return domain_root

2. 主函数

if __name__ == '__main__':
    urls = [
            "mimvp.com",
            "shop.mimvp.com",
            "pay.shop.mimvp.com",
            "http://mimvp.com/index.html",
            "https://mimvp.com/index.html",
            "https://shop.mimvp.com/index.html",
            "https://shop.mimvp.net",
            "https://shop.mimvp.cn",
            "https://shop.mimvp.mobi",
            "https://shop.mimvp.com.cn",
            "file:///D:/icp/api/index.html",
            "http://api.mimvp.org",
            "http://127.0.0.1:8000"
          ]
     
    for url in urls:
        print YGCheckICP.get_domain_root2(url)

3. 运行结果

-
-
-
mimvp.com
mimvp.com
shop.mimvp.com
shop.mimvp.net
shop.mimvp.cn
shop.mimvp.mobi
shop.mimvp.com.cn
-
api.mimvp.org
127.0.0.1:8000

结果说明：发现对已是根域名（mimvp.com）或二级、三级域名，并不能获取根域名，只是去掉了 http 等，并没有达到我们的预期需求

方法3：使用第三方库（可接受，但不推荐，毕竟依赖第三方库）

1. 安装 tld

pip install tld
Downloading tld-0.7.9-py2.py3-none-any.whl (154kB)
100% |████████████████████████████████| 163kB 20kB/s

2. 调用库方法

class YGCheckICP(object):
    @classmethod
    def get_domain_root3(cls, url):
        domain_root = ""
        try:
            from tld import get_tld
            
            ## 若不是 http或https开头，则补上方便正则匹配规则
            if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https" :
                url = "http://" + url
                
            domain_root = get_tld(url)
        except Exception, ex:
            domain_root = "-"
        return domain_root

3. 主函数

if __name__ == '__main__':
    urls = [
            "mimvp.com",
            "shop.mimvp.com",
            "pay.shop.mimvp.com",
            "http://mimvp.com/index.html",
            "https://mimvp.com/index.html",
            "https://shop.mimvp.com/index.html",
            "https://shop.mimvp.net",
            "https://shop.mimvp.cn",
            "https://shop.mimvp.mobi",
            "https://shop.mimvp.com.cn",
            "file:///D:/icp/api/index.html",
            "http://api.mimvp.org",
            "http://127.0.0.1:8000"
          ]
     
    for url in urls:
        print YGCheckICP.get_domain_root3(url)

4. 运行结果

mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.com
mimvp.net
mimvp.cn
mimvp.mobi
mimvp.com.cn
-
mimvp.org
-

结果说明：url 必须为 http:// 或 https://开头，否则会抛出异常，处理异常即可

其他可以使用的解析模块：

tld
tldextract
publicsuffix

米扑博客

Most Valuable Package of Mobile Internet

标签云

打赏赞助

访客统计

分类 (24)

归档 (192)

友情链接

Python 获取URL根域名

发表评论