Python替换字符串的正则匹配re.sub正确用法

re.sub 介绍

re是Regular Expression的所写，表示正则表达式，sub是substitute的所写，表示替换的意思；

re.sub是个正则表达式方面的函数，用来实现通过正则表达式，实现比普通字符串的replace更加强大的替换功能；

re.sub 语法：re.sub(pattern, repl, string, count=0, flags=0)

前三个必选参数：pattern, repl, string，后两个可选参数：count, flags

最简单的示例

如果输入字符串是

inputStr = "hello 111 world 111"

那么可以通过

replacedStr = inputStr.replace("111", "222")

替换成了

"hello 222 world 222"

但是，如果输入字符串是：

inputStr = "hello 123 world 456"

而你是想把123和456，都换成222，以及其他更多的复杂的情况的时候，那么就没法直接通过字符串的replace达到这一目的了。

这时，就需要借助于re.sub，通过正则表达式，来实现这种相对复杂的字符串的替换：

replacedStr = re.sub("\d+", "222", inputStr)

当然，实际情况中，会有比这个例子更加复杂的，其他各种特殊情况，就只能通过此re.sub去实现如此复杂的替换的功能了。

所以，re.sub的含义，作用，功能就是：

对于输入的一个字符串，利用正则表达式强大的字符串处理功能，去实现相对复杂的字符串替换处理，然后返回被替换后的字符串

其中，re.sub还支持各种参数，比如count指定要替换的个数，flags=re.I 不区分字符串大小写等。

re.sub 语法详解

re.sub 语法：re.sub(pattern, repl, string, count=0, flags=0)

前三个必选参数：pattern, repl, string，后两个可选参数：count, flags

1、pattern，表示正则中的模式字符串

反斜杠加数字（\n）表示对应匹配的组，也就是用之前匹配到的字符串补充到这个位置，

例如：将“hello python, ni hao c, zai jian python”替换为PHP，代码如下替换成功

import re
inputStr="hello python, ni hao c, zai jian python"
replaceStr=re.sub(r"hello (\w+), ni hao (\w+), zai jian \1","PHP", inputStr)
print replaceStr

代码中的 \1 表示第一次匹配到的字符串也就是“python”，这样可以匹配原来的字符串，从而整个字符串替换为PHP

若改为下面这样（红色部分为不同的地方）

import re
inputStr="hello python, ni hao c, zai jian python"
replaceStr=re.sub(r"hello (\w+), ni hao (\w+), zai jian \2","PHP", inputStr)
print replaceStr

代码中的 \2 表示第二次匹配到的字符串也就是“c”，显然不能和原来的字符串匹配，所以不会替换

这里是为了说明区别 \n 代表的是第n次所匹配到字符串，而不是第n次用到的匹配模式

2、repl，表示要被替换的，可以是字符串也可以是函数

如果是字符串，则所有的反斜杠转义字符都会被处理
\n：处理换行符
\r ：处理回车符
不能被识别的转义字符，则只是被识别为普通的字符，例如：\j被处理为j这个字母本身
反斜杠加g以及中括号内一个名字，即\g<name>对应命名了的组

例如：

import re
inputStr="hello python, ni hao c, zai jian python"
replaceStr=re.sub(r"hello (\w+), ni hao (\w+), zai jian \1","\g<2>", inputStr)
print replaceStr

运行结束输出c，这里的g<2>表示用第二个匹配到的字符串进行替换

也可以用命名分组的方式

import re
inputStr="hello python, ni hao c, zai jian python"
replaceStr=re.sub(r"hello (?P<word1>\w+), ni hao (?P<word2>\w+), zai jian \1","\g<word2>",inputStr)
print replaceStr

输出结果为c，将每个匹配的字符串进行了命名，word2匹配到的是c

如果是函数，则可以这样使用

import re
def pythonSubDemo():
    inputStr="hello 123 world 456";
    def _add111(matched):
        intStr=matched.group("number")
        intValue=int(intStr)
        addValue=intValue+111
        addValueStr=str(addValue)
        return addValueStr
    replacedStr=re.sub("(?P<number>\d+)",_add111,inputStr)
    print replacedStr
if __name__=="__main__":
    pythonSubDemo()

主要代码解释，匹配字符串中的数字，将其命名为组number，一共匹配到两个123和456

将匹配到的东西执行_add111函数，判断是不是组名为number，然后执行。

3、string，要处理的字符串，即替换后的结果字符串，若过滤则设置为空 ''

4、count，限定替换的个数，默认为空或0，表示替换所有

5、flags，匹配模式

可以使用按位或‘|‘表示同时生效，也可以在正则表达式中指定。

re.I 忽略大小写

re.L 表示特殊字符集\w,\W,\b,\B,\s,\S

re.M 表示多行模式

re.S ‘.’包括换行符在内的任意字符

re.U 表示特殊字符集\w,\W,\b,\B,\d,\D,\s,\D

注意参数顺序用法

不要误把第五个参数flag的值，传递到第四个参数count中了，否则就会出错，部分替换或替换不完整。

re.sub 语法：re.sub(pattern, repl, string, count=0, flags=0)

Python中，（1）re.compile后再sub可以工作，但re.sub不工作，或者是（2）re.search后replace工作，但直接re.sub以及re.compile后再re.sub都不工作

遇到的问题：

当传递第四个参数，原以为是flag的值，结果实际上是count的值，所以导致re.sub部分替换或替换不完整。

所以要把全部参数指定清楚了，或显式指定特定参数，例如：

replacedStr = re.sub(replacePattern, orignialStr, replacedPartStr, flags=re.I); # 省略count参数，默认全部替换

或

replacedStr = re.sub(replacePattern, orignialStr, replacedPartStr, 0, re.I); # 全部指定参数，不省略

re.sub 函数的定义

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 
If the pattern isn’t found, string is returned unchanged. 
repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
...        r'static PyObject*\npy_\1(void)\n{',
...        'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:

>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
The pattern may be a string or an RE object.

The optional argument count is the maximum number of pattern occurrences to be replaced; 
count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. 
Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

In addition to character escapes and backreferences as described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

Changed in version 2.7: Added the optional flags argument.

参考推荐：

Python 正则匹配re模块

Python学习入门（13）——正则表达式

Python 正则表达式爬取网页分析HTML标签总结

re正则表达式、BeautifulSoup、lxml 性能对比

Python使用Beautiful Soup库

Python爬虫抓站的一些技巧

米扑博客

Most Valuable Package of Mobile Internet

标签云

打赏赞助

访客统计

分类 (24)

归档 (192)

友情链接

Python 替换字符串的正则匹配 re.sub 正确用法

发表评论