Python 获取URL根域名
米扑导航和米扑域名,需要从URL中获取根域名,总结方法分享出来。
从url中找到域名,首先想到的是正则匹配,然后寻找相应的类库和代码实现。
正则解析匹配,有很多不完备的地方,例如:url中有域名,域名后缀一直在不断增加等。
网上查到几种方法,一种是用Python中自带的模块和正则相结合来解析域名,另一种是使第三方用写好的解析模块直接解析出域名。
方法1: 正则匹配 (推荐)
1. 封装类函数
class YGCheckICP(object): topRootDomain = ( '.com','.la','.io','.co','.info','.net','.org','.me','.mobi', '.us','.biz','.xxx','.ca','.co.jp','.com.cn','.net.cn', '.org.cn','.mx','.tv','.ws','.ag','.com.ag','.net.ag', '.org.ag','.am','.asia','.at','.be','.com.br','.net.br', '.bz','.com.bz','.net.bz','.cc','.com.co','.net.co', '.nom.co','.de','.es','.com.es','.nom.es','.org.es', '.eu','.fm','.fr','.gs','.in','.co.in','.firm.in','.gen.in', '.ind.in','.net.in','.org.in','.it','.jobs','.jp','.ms', '.com.mx','.nl','.nu','.co.nz','.net.nz','.org.nz', '.se','.tc','.tk','.tw','.com.tw','.idv.tw','.org.tw', '.hk','.co.uk','.me.uk','.org.uk','.vg', ".com.hk") @classmethod def get_domain_root(cls, url): domain_root = "" try: ## 若不是 http或https开头,则补上方便正则匹配规则 if url[0:4] != "http" and url[0:5] != "https" : url = "http://" + url reg = r'[^\.]+('+'|'.join([h.replace('.',r'\.') for h in YGCheckICP.topRootDomain])+')$' pattern = re.compile(reg, re.IGNORECASE) parts = urlparse(url) host = parts.netloc m = pattern.search(host) res = m.group() if m else host domain_root = "-" if not res else res except Exception, ex: print("get_domain_root() -- error_msg: " + str(ex)) return domain_root
2. 主函数
if __name__ == '__main__': urls = [ "mimvp.com", "shop.mimvp.com", "pay.shop.mimvp.com", "http://mimvp.com/index.html", "https://mimvp.com/index.html", "https://shop.mimvp.com/index.html", "https://shop.mimvp.net", "https://shop.mimvp.cn", "https://shop.mimvp.mobi", "https://shop.mimvp.com.cn", "file:///D:/icp/api/index.html", "http://api.mimvp.org", "http://127.0.0.1:8000" ] for url in urls: print YGCheckICP.get_domain_root(url)
3. 运行结果
mimvp.com mimvp.com mimvp.com mimvp.com mimvp.com mimvp.com mimvp.net shop.mimvp.cn mimvp.mobi mimvp.com.cn file: mimvp.org 127.0.0.1:8000
结果说明:发现对 file:// 正则匹配不好,没有过滤掉,因此需要改进如下:
@classmethod def get_domain_root(cls, url): domain_root = "" try: ## 若不是 http或https开头,则补上方便正则匹配规则 if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https": url = "http://" + url reg = r'[^\.]+('+'|'.join([h.replace('.',r'\.') for h in YGCheckICP.topRootDomain])+')$' pattern = re.compile(reg, re.IGNORECASE) parts = urlparse(url) host = parts.netloc m = pattern.search(host) res = m.group() if m else host domain_root = "-" if not res else res except Exception, ex: print("get_domain_root() -- error_msg: " + str(ex)) return domain_root
修改为:
if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https"
含义是 url 中没有包含了 "://",则添加上 http:// 或 https:// ,可以解决 file:// 或 ftp:// 等问题
方法2: urllib 类解析 (不推荐)
1. 封装类函数
class YGCheckICP(object): @classmethod def get_domain_root2(cls, url): domain_root = "" try: proto, rest = urllib.splittype(url) res, rest = urllib.splithost(rest) domain_root = "-" if not res else res except Exception, ex: print("check_icp_beian() -- error_msg: " + str(ex)) return domain_root
2. 主函数
if __name__ == '__main__': urls = [ "mimvp.com", "shop.mimvp.com", "pay.shop.mimvp.com", "http://mimvp.com/index.html", "https://mimvp.com/index.html", "https://shop.mimvp.com/index.html", "https://shop.mimvp.net", "https://shop.mimvp.cn", "https://shop.mimvp.mobi", "https://shop.mimvp.com.cn", "file:///D:/icp/api/index.html", "http://api.mimvp.org", "http://127.0.0.1:8000" ] for url in urls: print YGCheckICP.get_domain_root2(url)
3. 运行结果
- - - mimvp.com mimvp.com shop.mimvp.com shop.mimvp.net shop.mimvp.cn shop.mimvp.mobi shop.mimvp.com.cn - api.mimvp.org 127.0.0.1:8000
结果说明:发现对已是根域名(mimvp.com)或二级、三级域名,并不能获取根域名,只是去掉了 http 等,并没有达到我们的预期需求
方法3: 使用第三方库(可接受,但不推荐,毕竟依赖第三方库)
1. 安装 tld
pip install tld
Downloading tld-0.7.9-py2.py3-none-any.whl (154kB)
100% |████████████████████████████████| 163kB 20kB/s
2. 调用库方法
class YGCheckICP(object): @classmethod def get_domain_root3(cls, url): domain_root = "" try: from tld import get_tld ## 若不是 http或https开头,则补上方便正则匹配规则 if len(url.split("://")) <= 1 and url[0:4] != "http" and url[0:5] != "https" : url = "http://" + url domain_root = get_tld(url) except Exception, ex: domain_root = "-" return domain_root
3. 主函数
if __name__ == '__main__': urls = [ "mimvp.com", "shop.mimvp.com", "pay.shop.mimvp.com", "http://mimvp.com/index.html", "https://mimvp.com/index.html", "https://shop.mimvp.com/index.html", "https://shop.mimvp.net", "https://shop.mimvp.cn", "https://shop.mimvp.mobi", "https://shop.mimvp.com.cn", "file:///D:/icp/api/index.html", "http://api.mimvp.org", "http://127.0.0.1:8000" ] for url in urls: print YGCheckICP.get_domain_root3(url)
4. 运行结果
mimvp.com mimvp.com mimvp.com mimvp.com mimvp.com mimvp.com mimvp.net mimvp.cn mimvp.mobi mimvp.com.cn - mimvp.org -
结果说明:url 必须为 http:// 或 https://开头,否则会抛出异常,处理异常即可
其他可以使用的解析模块:
tld
tldextract
publicsuffix
版权所有: 本文系米扑博客原创、转载、摘录,或修订后发表,最后更新于 2018-03-25 05:50:21
侵权处理: 本个人博客,不盈利,若侵犯了您的作品权,请联系博主删除,莫恶意,索钱财,感谢!
转载注明: Python 获取URL根域名 (米扑博客)