网站首页 > 博客 > 正文

【Python进阶必备】一文掌握re库：实战正则表达式

jemmiexu 博客 2024-05-15 1 0

re库初识

re库基础使用方法

compile()函数

基本用法

正则表达式常用规则字符

match与search方法

match

match/search

findall与finditer方法

使用findall()返回所有匹配项

使用findall()提取多个组的匹配

使用finditer()逐个返回Match对象

使用finditer()并处理复杂匹配结构

进阶用法

分组与反向引用

替换文本中的部分内容

提取并重组子组

在搜索结果中使用子组

贪婪与懒惰匹配

预定义字符集与特殊字符

结语与讨论

亲爱的读者，你是否在编程过程中遇到过字符串处理难题？是否对繁琐复杂的文本匹配操作感到困扰？今天，我们就一起深入探索Python世界中的强大工具——re模块，它是Python标准库中用于处理正则表达式的利器，帮你轻松驾驭各类字符串处理任务。

re库初识

Python的re模块提供了完整的正则表达式功能。正则表达式（Regular Expression）是一种强大的文本模式匹配工具，它能高效地进行查找、替换、分割等复杂字符串操作。

在Python中，通过 import re 即可引入这一神器。

re库基础使用方法

compile()函数

首先，我们需要使用re.compile()函数将正则表达式编译为Pattern对象

基本用法

import re

# 匹配一个或多个连续的数字字符

pattern = re.compile(r'\d+')

# 匹配email电邮地址

email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', re.IGNORECASE)

# 匹配任意字母数字组成的用户名（至少1个字符）

username_pattern = re.compile(r'\w+')

# 匹配任意URL链接

url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\$\$,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

# 匹配电话号码（格式如：123-456-7890 或 (123) 456-7890）

phone_pattern = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|$\d{3}$\s*\d{3}[-\.\s]??\d{4})')

# 匹配IPv4地址

ipv4_pattern = re.compile(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)')

# 匹配信用卡号（一般为16位数字，可能包含空格分隔符）

credit_card_pattern = re.compile(r'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}')

# 匹配日期格式（YYYY-MM-DD）

date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')

# 匹配颜色代码（如 #FF0000）

color_code_pattern = re.compile(r'^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$')

# 匹配整数和小数（包括负数、正数和零）

number_pattern = re.compile(r'-?\d+(\.\d+)?')

正则表达式常用规则字符

\d：在大多数正则表达式语法中（包括Python中的 re 模块），\d 相当于 [0-9]，即它会匹配任意一个十进制数字字符，相当于阿拉伯数字从0到9。 +：这是一个量词，表示前面的元素（这里是\d）至少出现一次或多次。因此，\d+ 作为一个整体，它会匹配一个或连续的一个以上数字字符，例如 "123"、"456789" 等等。 \w：匹配字母（大写或小写）、数字和下划线（等价于 [a-zA-Z0-9_]）。 \s：匹配任何空白字符，包括空格、制表符、换行符等。. （句点）：匹配除换行符之外的任何单个字符。^：在字符串起始位置时匹配，或者在字符类 [] 中表示反向选择（如 [^abc] 匹配非 a、b、c 的字符）。$：在字符串结束位置时匹配。*：零次或多次匹配前面的元素。?：零次或一次匹配前面的元素。{m,n}：前面的元素至少出现 m 次，至多出现 n 次。|：表示“或”操作，用于匹配多个选项之一。()：用于分组和捕获子匹配项。

re.compile(pattern, flags=0) 的作用是：

预编译：将正则表达式转换为编译过的模式对象，提高后续匹配操作的速度。复用：创建一次编译好的模式后，可以在程序的不同地方重复使用该模式进行匹配、查找、替换等操作。支持标志：可以传递标志参数来改变正则表达式的默认行为，如忽略大小写、多行模式等。

match与search方法

pattern.match()方法只检测字符串开始位置是否满足匹配条件；而pattern.search()方法会搜索整个字符串以找到第一个匹配项。

match

import re

text = "2023-01-01 This is a date at the start of the string."

# 使用match()方法，只从字符串开始位置匹配日期格式

pattern = re.compile(r'\d{4}-\d{2}-\d{2}')

match_result = pattern.match(text)

if match_result:

print(f"Match found: {match_result.group(0)}")

else:

print("No match at the beginning of the string.")

# 输出：

# Match found: 2023-01-01

import re

text = "The date today is 2023-01-01, let's remember it."

# 使用search()方法在整个字符串中搜索日期格式

pattern = re.compile(r'\d{4}-\d{2}-\d{2}')

search_result = pattern.search(text)

if search_result:

print(f"Search found: {search_result.group(0)}")

else:

print("No match found in the string.")

# 输出：

# Search found: 2023-01-01

match/search

import re

text = "This sentence does not start with a date like 2023-01-01."

# match()不会找到任何匹配项，因为日期不在字符串开头

match_result = re.match(r'\d{4}-\d{2}-\d{2}', text)

if match_result:

print("Match found.")

else:

print("No match at the beginning using match().")

# search()能找到匹配项，因为它搜索整个字符串

search_result = re.search(r'\d{4}-\d{2}-\d{2}', text)

if search_result:

print("Search found.")

else:

print("No match found anywhere using search().")

# 输出：

# No match at the beginning using match().

# Search found.

findall与finditer方法

pattern.findall()返回所有非重叠匹配结果的列表；pattern.finditer()返回一个迭代器，逐个返回Match对象。

使用findall()返回所有匹配项

import re

text = "The3 quick5 brown5 fox3 jumps5 over4 the3 lazy4 dog."

# 找到文本中所有的"fox"

pattern = re.compile(r'\d+')

matches = pattern.findall(text)

print(matches)

# 输出: ['3', '5', '5', '3', '5', '4', '3', '4']

使用findall()提取多个组的匹配

import re

text = "John Doe, Jane Smith, Alice Johnson"

# 提取所有名字和姓氏

pattern = re.compile(r'(\w+) (\w+)')

matches = pattern.findall(text)

print(matches)

# 输出: [('John', 'Doe'), ('Jane', 'Smith'), ('Alice', 'Johnson')]

# 返回的是元组组成的列表，每个元组代表一个匹配的结果，其中包含了括号分组的内容

使用finditer()逐个返回Match对象

import re

text = "I have 3 apples and 7 bananas in 2 baskets."

# 查找所有数字

pattern = re.compile(r'\d+')

for match in pattern.finditer(text):

print(match.group(0))

# 输出：

# 3

# 7

# 2

# finditer()方法逐个返回Match对象，并可以通过group()方法获取匹配的具体内容

使用finditer()并处理复杂匹配结构

import re

text = "colors: red, colors:blue; shapes: square, shapes:circle"

# 匹配颜色或形状

pattern = re.compile(r'(?:colors?[:\s]+(\w+)(?:[,;\s]|$))|(?:shapes?[:\s]+(\w+)(?:[,;\s]|$))')

for match in pattern.finditer(text):

if match.group(1): # 如果是颜色

print(f"Color found: {match.group(1)}")

elif match.group(2): # 如果是形状

print(f"Shape found: {match.group(2)}")

# 输出：

# Color found: red

# Color found: blue

# Shape found: square

# Shape found: circle

进阶用法

分组与反向引用

通过圆括号可以创建子组，以便捕获和引用部分匹配内容。如re.compile(r'(\w+) (\d+)')，\1和\2分别代表第一个和第二个子组的内容。

替换文本中的部分内容

import re

text = "John Doe has 3 apples and Jane Smith has 7 bananas."

pattern = re.compile(r'(\w+) (\d+)')

new_text = pattern.sub(r'\1 has \2 fruits', text)

print(new_text)

# 输出: "John Doe has 3 fruits and Jane Smith has 7 fruits."

# 在这个例子中，\1 替换为第一个子组（名字），\2 替换为第二个子组（数字）

提取并重组子组

import re

text = "The date is 2023-01-01, and the time is 15:30:45."

pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')

match = pattern.search(text)

if match:

date_reformatted = f"{match.group(1)}.{match.group(2)}.{match.group(3)}"

print(date_reformatted)

# 输出: "2023.01.01"

# 这里直接通过group()方法获取每个子组的内容，并重新组合

在搜索结果中使用子组

import re

text = "Some emails are user1@exam.com, user2@apple.net, and user3@example.org."

pattern = re.compile(r'([\w.%+-]+)@([\w.-]+)\.([a-z]{2,})')

matches = pattern.findall(text)

for email in matches:

username, domain, dtype = email[0], email[1], email[2]

print(f"Username: {username}, Domain: {domain}.{dtype}")

# 使用子组匹配的邮箱用户名和域名

# 输出：

# Username: user1, Domain: exam.com

# Username: user2, Domain: apple.net

# Username: user3, Domain: example.org

贪婪与懒惰匹配

*、+、?后添加?可变为非贪婪模式，尽可能少地匹配字符。

贪婪与非贪婪的 * 量词

import re

text = "I love Python programming and Java programming very much!"

# 贪婪模式

pattern_greedy = re.compile(r'love.*programming')

match_greedy = pattern_greedy.search(text)

print(match_greedy.group(0)) # 输出: 'love Python programming and Java programming'

# 非贪婪模式

pattern_lazy = re.compile(r'love.*?programming')

match_lazy = pattern_lazy.search(text)

print(match_lazy.group(0)) # 输出: 'love Python programming'

贪婪与非贪婪的 + 量词

import re

text = "The numbers are 139-626 and 123456."

# 贪婪模式

pattern_greedy = re.compile(r'\d+')

matches_greedy = pattern_greedy.findall(text)

print(matches_greedy)

# 输出: ['139', '626', '123456']

# 非贪婪模式

pattern_lazy = re.compile(r'\d+?')

matches_lazy = pattern_lazy.findall(text)

print(matches_lazy)

# 输出: ['1', '3', '9', '6', '2', '6', '1', '2', '3', '4', '5', '6']

贪婪与非贪婪的 ? 量词

import re

text = "Optional text or not?"

# 贪婪模式

pattern_greedy = re.compile(r'(Optional)?.*')

match_greedy = pattern_greedy.search(text)

print(match_greedy.group(0)) # 输出: 'Optional text or not?'

# 非贪婪模式

pattern_lazy = re.compile(r'(Optional)?.*?')

match_lazy = pattern_lazy.search(text)

print(match_lazy.group(0)) # 输出: 'Optional'

预定义字符集与特殊字符

\d、\D、\w、\W、\s、\S分别代表数字、非数字、单词字符、非单词字符、空白符、非空白符。

结语与讨论

正则表达式和re库的强大远不止于此，其深度和灵活性足以应对各种复杂的文本处理场景。然而，掌握好这门艺术需要不断的实践和积累，本文只是带你踏入了Python re库的门槛，但正则表达式的奥秘还等待着你进一步挖掘。实践中如果遇到“明明规则写得对，为何匹配不上？”这类疑问，不妨回看本文，或是在留言区留下你的问题，我们一同探讨解惑，让正则表达式真正成为你手中的“文本魔法棒”。

夸智网

【Python进阶必备】一文掌握re库：实战正则表达式

论文笔记之：Generative Adversarial Text to Image Synthesis

论文笔记之：Conditional Generative Adversarial Nets

发表评论取消回复

夸智网

【Python进阶必备】一文掌握re库：实战正则表达式

论文笔记之：Generative Adversarial Text to Image Synthesis

论文笔记之：Conditional Generative Adversarial Nets

相关文章

发表评论取消回复