字符串模式匹配：Z 算法

概述

Z 算法是一种用于在字符串中搜索给定模式的算法。它是一种有效的算法，因为它具有线性时间复杂度。它的时间复杂度为 O（m+n），其中 m 是字符串的长度，n 是要搜索的模式的长度。

Z 算法简介

有没有想过如何匹配给定字符串中的模式，这太有效了。例如，如果您需要将 DNA 序列与 DNA 模式进行匹配。DNA序列的平均长度约为1500亿个。在这种情况下使用暴力字符串匹配算法将导致非常高的处理时间，因为它具有二次时间复杂度，因此使用起来不切实际。

好吧，这个问题的解决方案可能是 Z 算法，因为它是一种高效的字符串匹配算法。继续阅读以了解更多信息。

Z算法是一种模式搜索算法。这意味着它用于在字符串中搜索给定模式的出现。它是一种有效的算法，因为它具有线性时间复杂度。

什么是Z算法？

为了应用 Z 算法，我们需要一个字符串和一个要搜索的模式。

如前几节所述，Z 算法是一种用于查找字符串中出现的模式的算法。现在让我们看看它是如何工作的。

Z 算法通过维护一个称为 Z 数组的辅助数组来工作。这个 Z 数组存储最长子字符串的长度，从当前索引开始，也是它的前缀。这意味着每个索引都存储字符数，与起始字符匹配，从此索引开始。这意味着，如果 Z 数组对任何索引都有值 k，则表示该索引之后的 k 个字符与字符串的前 k 个字符匹配。这是Z算法的基本部分。

为了使用 Z 算法，我们首先将搜索模式和要搜索的字符串连接在一起，在两者之间添加另一个字符，该字符在任何字符串中都不存在。然后我们为这个新创建的字符串生成 Z 数组。

然后我们扫描 Z 数组，在每个索引处，其值等于要搜索的模式的长度，我们说该模式已被找到。

newString=pattern+'$'+字符串

其中 $ 可以是字符串或模式中不存在的任何字符。

Z算法分析

现在是时候分析上述技术为什么以及如何工作了。让我们首先了解一下 Z 数组的使用。Z 数组在每个索引处存储从它开始的最长子字符串的长度，与字符串的前缀（起始字符）匹配。现在，当我们用一个附加字符将模式连接到我们的原始字符串时，基本上我们将新创建的字符串的前缀（即我们的模式）与字符串的其余部分（即要搜索的字符串）匹配。

由于我们添加了一个在任何字符串中都不存在的附加字符，因此我们确信永远不会找到比我们的模式更长的字符串。

让我们借助以下示例来理解它。

正如我们在示例中看到的，我们已经将模式和字符串连接在一起，中间有一个附加字符，这两个字符都不存在。现在，当我们生成 Z 数组时，我们基本上是在搜索字符串的前 4 个字符，这也是我们的模式。由于我们添加了一个额外的字符，因此我们可以找到的最大模式将是我们需要搜索的模式。现在，让我们分析一下 Z 阵列。

Z 数组的第一个值显式设置为零。正如我们在索引 1 处的数组中看到的那样，当 'a' 与索引 0（'a'）处的字符匹配时，该值设置为 1。然后，由于所有值都不匹配，因此它们设置为零。现在让我们看看索引 10 处的值。它设置为 4，因为从此索引开始的 4 个字符与字符串的前缀匹配（从 4 个字符开始）。

现在，为了找到模式的出现，我们扫描 Z 数组，我们可以说只要 Z 值等于模式的长度，就可以找到模式。这就是 Z 算法的工作原理。

Z算法的工作原理

现在让我们了解 Z 算法的工作原理。

首先让我们了解为什么 Z 算法使用窗口。Z 算法通过使用 Z 数组中的先前值来加快其执行速度，并且这些值根据当前窗口使用。我们检查当前窗口内是否有可能有一个大于当前长度的字符串，如果不可能，我们跳过窗口内剩余值的计算。

这在很大程度上加快了算法的速度，因为跳过了许多元素。如果不使用这种技术，该算法将对所有元素执行计算，从而简化为二次算法，等同于朴素模式搜索。

现在让我们了解 Z 算法如何使用窗口来加速其执行。Z 算法在两个变量（左变量和右变量）之间维护一个窗口。创建 Z 数组时，工作仅在此窗口内完成，并且窗口会不断更新。现在，当我们在窗口内时，我们找到窗口内的当前元素位置，让它为 k。

然后我们检查窗口中 k 之后的元素数量，如果它大于 Z 数组中第 k 个索引处的值，则当前索引处的 Z 值等于第 k 个索引处的 Z 值。

这就是 Z 算法中的窗口用于加速执行的方式。该窗口基本上使用 Z 数组的已填充值。

算法

Start

z_algo(search_string,pattern)

concatStr = concatenate pattern + “$” + text

patLen = length of pattern

n = length of concatStr

left = 0 and right = 0

for i = 1 to n, do

if i > right, then

left = i and right = i

while right < n AND concatStr[right-left]=concatStr[right], do

increase right by 1

done

ZArray[i] = right – left

decrease right by 1

else

k = i – left

if ZArray[k] < right – i +1, then

ZArray[i] = ZArray[k]

else

left = i

while right < n AND concatStr[right-left]=concatStr[right], do

increase right by 1

done

ZArray[i] = right – left

decrease right by 1

for i = 0 to n – 1, do

if ZArray[i] = patLen, then

print the location i – patLen – 1

End

Z 算法利用字符串的先前扫描部分来加快其执行速度。我们维护一个匹配前缀的窗口 [left，right]，如果发现不匹配，则更新该窗口。

维护窗口的步骤是 -

如果 i >右边，那么没有前缀子字符串在 i 之前开始并在 i 之后结束，因此我们重置左右并通过比较 str[0..] 和 str[i..] 来计算新的 [left，right] 并得到 ZArray[i] = （right-left+1）。如果 i <= right，那么让 K = i-left，现在 ZArray[i] >= min（Z[K]， right-i+1），因为 str[i..] 与 str[K..] 匹配，至少为 right-i+1 个字符（它们在 [left，right] 间隔中，我们知道这是一个前缀子字符串）。现在出现了两个子案例—— a）如果 ZArray[K] < right-i+1，则没有从 str[i] 开始的前缀子字符串（否则 ZArray[K] 会更大），因此 ZArray[i] = ZArray[K] 和间隔 [left，right] 保持不变。 b）如果 ZArray[K] >= right-i+1，则可以扩展 [left，right] 间隔，因此我们将 left 设置为 i 并从 str[right] 开始匹配并获得新的 right，然后我们将更新间隔 [left，right] 并计算 ZArray[i] = （right-left+1）。

试运行

让我们以字符串 caabxaaab 为例，搜索模式 aab。我们要做的第一件事是通过连接旧字符串来生成一个新字符串。因此，新字符串变为 aab$caabxaaab。现在我们将为这个字符串生成 Z 数组：

Z 数组中索引零处的值没有用，因此可以将其留空，因此我们从索引 1 开始。Z_Array[1] 将等于 1，因为只有一个字符 a 与起始字符 a 匹配，而 b 不匹配。Z_Array[2] 将为零，因为 b 与起始字符不匹配。Z_Array[3] 和 Z_Array[4] 也将等于零，因为两者都不匹配。在 Z_Array[5] 中，我们找到了匹配项，因此我们通过搜索来显式计算它，直到找不到不匹配项。Z_Array[5]结果是3。因此，窗口变为 [5,7]。Z_Array[6]没有显式计算，因为它在窗口内，大于Z_Array[1]后的元素数量。因此Z_Array[6]等于Z_Array[1]，即1。在这里，我们使用了先前计算的值。Z_Array[7]也不会被计算，Z_Array[2]的先前值被用作Z_Array[7]的值。现在我们从上一个窗口中出来，创建一个大小为 1 的新窗口。Z_Array[8] 被显式计算，结果为 1。在Z_Array[9]中，我们找到了一个匹配项，因此我们通过搜索来显式计算它，直到找不到不匹配项。Z_Array[9]结果是2。因此，窗口变为 [9,11]。Z_Array[10]位于窗口内，但Z值在窗口中的位置，即Z[1]不小于窗口中剩余的元素数量。所以Z_Array[10]是显式计算的。结果是 3。窗口现在变为 [10,12]。Z_Array[11]没有显式计算，因为它在窗口内，大于Z_Array[1]后的元素数量。因此Z_Array[11]等于Z_Array[1]，即1。在这里，我们使用了先前计算的值。Z_Array[12]也不会被计算，Z_Array[2]的先前值被用作Z_Array[12]的值。

最终 Z 数组 - _ 1 0 0 0 3 1 0 0 2 3 1 0

这就是 Z 数组的创建方式。然后，我们搜索值等于搜索模式长度的索引，并在该索引处找到我们的元素。

Z算法的时间复杂度

Z 算法的时间复杂度为 O（m+n），其中 n 是要搜索的字符串的长度，m 是要搜索的模式的长度。

可以计算如下：我们新创建的字符串的长度是 m+n。遍历字符串需要线性时间，即 = O（m+n）

每当遇到 while 循环时，假设执行了 k 次操作，但会跳过外部循环的接下来 k 次迭代，因为上限每次都会增加 k。

因此，创建 Z 阵列所需的总时间仍然是 O（m+n）。

在 Z 数组中搜索 m 所花费的时间也需要 O（m+n）时间。

所以总时间复杂度为：-

创建 Z 数组所花费的总时间仍然是 O（m+n） + 在 Z 数组中搜索 m 所花费的时间也需要 O（m+n）时间

=O（米+n）+O（米+n）

=O（米+n）

Z算法的实现

C/C++ 中的 Z 算法

#include

using namespace std;

void create_Zarr(string str, int Z[])

{

int n = str.length();

int left, right, k;

// [left,right] make a window which matches with prefix of str

left = right = 0;

for (int i = 1; i < n; ++i)

{

// if i>right nothing matches so we will calculate.

// Z[i] using naive way.

if (i > right)

{

left = right = i;

// right-left = 0 in starting, so it will start

// checking from 0'th index.

while (right

right++;

Z[i] = right-left;

right--;

}

else

{

// k = i-left so k corresponds to number which

// matches in [left,right] interval.

k = i-left;

// if Z[k] is less than remaining interval

// then Z[i] will be equal to Z[k].

if (Z[k] < right-i+1)

Z[i] = Z[k];

else

{

// else start from right and check manually

left = i;

while (right

right++;

Z[i] = right-left;

right--;

}

void find(string text, string pattern)

{

// Create concatenated string "P$T with additional character"

string concat = pattern + "$" + text;

int l = concat.length();

// Constructing Z array

int Z[l];

create_Zarr(concat, Z);

// now looping through Z array for matching condition

for (int i = 0; i < l; ++i)

{

// if Z[i] (matched region) is equal to pattern

// length we got the pattern

if (Z[i] == pattern.length())

cout << "Pattern found at index "

<< i - pattern.length() -1 << endl;

}

// Driver program

int main()

{

string text = "faabbcdeffghiaaabbcdfgaabf";

string pattern = "aabb";

find(text, pattern);

return 0;

}

输出

Pattern found at index 1

Pattern found at index 14

C++代码解释

main 函数由字符串和模式组成。然后使用字符串和模式作为参数调用查找结果。现在让我们看看 find 函数中会发生什么。

find 函数做的第一件事是通过将模式与字符串连接起来来创建字符串 concat，中间有一个附加字符 $。现在我们将这个字符串 concat 传递给 create_Zarr 函数，为这个字符串创建 Z 数组。现在让我们看看 Z 数组是如何创建的。

在 createZarr 函数中，我们要做的第一件事是创建两个名为 left 和 right 的变量，以维护用于创建数组的搜索窗口。

左和右都初始化为零。现在我们开始遍历字符串并填充 Z 数组。Z 数组的第一个值从不使用，可以不赋值。现在我们检查我们当前的指针是否 i.

如果它大于 right，则意味着我们当前不在窗口之外，因此我们使用朴素字符串匹配来计算此索引处的 Z 值，即逐个匹配每个字符，直到发现不匹配。该值将成为当前索引的 Z 值。

现在，如果我们的当前索引当前在窗口内，即小于右，我们计算 k，它指定了当前索引在窗口内的位置。如果第 k 个索引处的 Z 值小于 right-i+1，即 k 之后窗口内的元素数，则当前索引处的 Z 值等于第 k 个索引处的 Z 值，窗口将保持不变。

如果第 k 个索引处的 Z 值大于或等于 right-i+1，则可以延长当前窗口。因此，我们将设置 left=i 并像以前一样再次开始匹配字符串。匹配字符串后，当前索引 Z[i] 的 Z 值将变为等于 right-left。

这就是为 concat 字符串创建 Z 数组的方式。现在要找到字符串中模式的出现，我们在 Z 数组中搜索等于模式长度的值，如果在索引 i 处找到它，我们可以输出该模式在索引 i-pattern_length-1 处找到（删除添加到原始字符串中以创建 Z 数组的额外长度）。

这就是 Z 算法在字符串中查找所有出现的模式的工作方式。

2. JAVA中的Z算法

class z_algo {

// prints all occurrences of pattern in text using

// Z algo

public static void find(String text, String pattern)

{

// Create concatenated string "P$T"

String concat = pattern + "$" + text;

int l = concat.length();

int Z[] = new int[l];

// Construct Z array

create_Zarr(concat, Z);

// now looping through Z array for matching condition

for(int i = 0; i < l; ++i){

// if Z[i] (matched region) is equal to pattern

// length we got the pattern

if(Z[i] == pattern.length()){

System.out.println("Pattern found at index "

+ (i - pattern.length() - 1));

}

// Fills Z array for given string str[]

private static void create_Zarr(String str, int[] Z) {

int n = str.length();

// [left,right] make a window which matches with

// prefix of s

int left = 0, right = 0;

for(int i = 1; i < n; ++i) {

// if i>right nothing matches so we will calculate.

// Z[i] using naive way.

if(i > right){

left = right = i;

while(right < n && str.charAt(right - left) == str.charAt(right))

right++;

Z[i] = right - left;

right--;

}

else{

// k = i-left so k corresponds to number which

// matches in [left,right] interval.

int k = i - left;

if(Z[k] < right - i + 1)

Z[i] = Z[k];

else{

// else start from right and check manually

left = i;

while(right < n && str.charAt(right - left) == str.charAt(right))

right++;

Z[i] = right - left;

right--;

}

// Driver program

public static void main(String[] args)

{

String text = "faabbcdeffghiaaabbcdfgaabf";

String pattern = "aabb";

find(text, pattern);

}

输出

Pattern found at index 1

Pattern found at index 14

3. Python 中的 Z 算法

# Fills Z array for given string str[]

def create_Zarr(string, z):

n = len(string)

# [L,R] make a window which matches

# with prefix of s

left, right, k = 0, 0, 0

for i in range(1, n):

# if i>R nothing matches so we will calculate.

# Z[i] using naive way.

if i > right:

left, right = i, i

# R-L = 0 in starting, so it will start

# checking from 0'th index.

while right < n and string[right - left] == string[right]:

right += 1

z[i] = right - left

right -= 1

else:

# k = i-L so k corresponds to number which

# matches in [L,R] interval.

k = i - left

# if Z[k] is less than remaining interval

# then Z[i] will be equal to Z[k].

if z[k] < right - i + 1:

z[i] = z[k]

else:

# else start from R and check manually

left = i

while right < n and string[right - left] == string[right]:

right += 1

z[i] = right - left

right -= 1

# prints all occurrences of pattern

# in text using Z algo

def find(text, pattern):

# Create concatenated string "P$T"

concat = pattern + "$" + text

left = len(concat)

# Construct Z array

z = [0] * left

create_Zarr(concat, z)

# now looping through Z array for matching condition

for i in range(left):

# if Z[i] (matched region) is equal to pattern

# length we got the pattern

if z[i] == len(pattern):

print("Pattern found at index",

i - len(pattern) - 1)

# Driver Code

if __name__ == "__main__":

text = "faabbcdeffghiaaabbcdfgaabf"

pattern = "aabb"

find(text, pattern)

输出

Pattern found at index 1

Pattern found at index 14

Z 算法示例

Z 算法可用于查找字符串中的任何模式。

让我们看一些例子。

1. DNA序列

Z 算法可用于查找 DNA 序列中的 DNA 模式。此应用非常重要，因为 DNA 序列是非常大的字符串，因此需要一种有效的算法。

在这个例子中，我们在长度为100的DNA序列中搜索atgc模式。

输入

String=cgactgttatgggttcagtctcgttagtaaataatacaaaatgcccgttcacagctaaggttcatccgtgccgcggtaagtcccgttttcggcagcttca

Pattern=atgc

法典

#include

using namespace std;

void create_Zarr(string str, int Z[])

{

int n = str.length();

int left, right, k;

// [left,right] make a window which matches with prefix of str

left = right = 0;

for (int i = 1; i < n; ++i)

{

// if i>right nothing matches so we will calculate.

// Z[i] using naive way.

if (i > right)

{

left = right = i;

// right-left = 0 in starting, so it will start

// checking from 0'th index. For example,

// for "ababab" and i = 1, the value of right

// remains 0 and Z[i] becomes 0. For string

// "aaaaaa" and i = 1, Z[i] and right become 5

while (right

right++;

Z[i] = right-left;

right--;

}

else

{

// k = i-left so k corresponds to number which

// matches in [left,right] interval.

k = i-left;

// if Z[k] is less than remaining interval

// then Z[i] will be equal to Z[k].

// For example, str = "ababab", i = 3, right = 5

// and left = 2

if (Z[k] < right-i+1)

Z[i] = Z[k];

// For example str = "aaaaaa" and i = 2, right is 5,

// left is 0

else

{

// else start from right and check manually

left = i;

while (right

right++;

Z[i] = right-left;

right--;

}

void find(string text, string pattern)

{

// Create concatenated string "P$T with additional character"

string concat = pattern + "$" + text;

int l = concat.length();

// Constructing Z array

int Z[l];

create_Zarr(concat, Z);

// now looping through Z array for matching condition

for (int i = 0; i < l; ++i)

{

// if Z[i] (matched region) is equal to pattern

// length we got the pattern

if (Z[i] == pattern.length())

cout << "Pattern found at index "

<< i - pattern.length() -1 << endl;

}

// Fills Z array for given string str[]

// Driver program

int main()

{

string text = "cgactgttatgggttcagtctcgttagtaaataatacaaaatgcccgttcacagctaaggttcatccgtgccgcggtaagtcccgttttcggcagcttca";

string pattern = "atgc";

find(text, pattern);

return 0;

}

输出

Pattern found at index 40

2. 单词搜索

Z 算法还可用于查找句子中单词的出现次数。在这个例子中，我们发现了句子中单词 the 的出现。

输入

String=这句话中出现的可以使用 Z 算法找到模式=Pattern =

法典

#include

using namespace std;

void create_Zarr(string str, int Z[])

{

int n = str.length();

int left, right, k;

// [left,right] make a window which matches with prefix of str

left = right = 0;

for (int i = 1; i < n; ++i)

{

// if i>right nothing matches so we will calculate.

// Z[i] using naive way.

if (i > right)

{

left = right = i;

// right-left = 0 in starting, so it will start

// checking from 0'th index. For example,

// for "ababab" and i = 1, the value of right

// remains 0 and Z[i] becomes 0. For string

// "aaaaaa" and i = 1, Z[i] and right become 5

while (right

right++;

Z[i] = right-left;

right--;

}

else

{

// k = i-left so k corresponds to number which

// matches in [left,right] interval.

k = i-left;

// if Z[k] is less than remaining interval

// then Z[i] will be equal to Z[k].

// For example, str = "ababab", i = 3, right = 5

// and left = 2

if (Z[k] < right-i+1)

Z[i] = Z[k];

// For example str = "aaaaaa" and i = 2, right is 5,

// left is 0

else

{

// else start from right and check manually

left = i;

while (right

right++;

Z[i] = right-left;

right--;

}

void find(string text, string pattern)

{

// Create concatenated string "P$T with additional character"

string concat = pattern + "$" + text;

int l = concat.length();

// Constructing Z array

int Z[l];

create_Zarr(concat, Z);

// now looping through Z array for matching condition

for (int i = 0; i < l; ++i)

{

// if Z[i] (matched region) is equal to pattern

// length we got the pattern

if (Z[i] == pattern.length())

cout << "Pattern found at index "

<< i - pattern.length() -1 << endl;

}

// Fills Z array for given string str[]

// Driver program

int main()

{

string text = "the occurence of the in this sentence can be found using the Z algo";

string pattern = "the";

find(text, pattern);

return 0;

}

输出

Pattern found at index 0

Pattern found at index 17

Pattern found at index 57

Z算法的应用

Z 算法在寻找 DNA 序列中的 DNA 模式方面具有重要应用。在这种情况下使用它是因为 Z 算法在线性时间内工作，并且由于 DNA 序列非常大，因此 Z 算法可以有效地工作。Z 算法还可用于查找句子中出现的所有单词。Z 算法可以在任何地方用于查找字符串中的模式。

Z Algortihm 与其他

与Z算法相媲美的字符串匹配算法有Rabin-Karp算法、KMP算法、朴素字符串匹配。

朴素字符串搜索算法匹配模式，在每个索引上检查它，而 Rabin Karp 遵循类似的方法，但它使用哈希函数来匹配模式。

KMP 算法遵循与 Z 算法类似的方法，但它使用辅助数组来存储模式的正确前缀的最长长度，该前缀也是模式的后缀。

字符串搜索算法的时间复杂度

不。算法最佳案例平均案例最坏情况1朴素字符串搜索O（n）O（（n-m+1）*m）O（n*m）2拉宾·卡普O（n+m）O（n+m）O（n*m）3KMP公司O（n+m）O（n+m）O（n+m）4Z 算法O（n+m）O（n+m）O（n+m）

这里 n 是要在其中搜索模式的字符串的长度，m 是要搜索的模式的长度。

在上表中，我们可以看到朴素字符串匹配和 Rabin Karp 算法具有非常高的时间复杂度，因此应避免将它们用于大输入。

KMP算法和Z算法具有相似的时间复杂度，可以互换使用，但应优先使用Z算法，因为它更容易编码和理解，甚至调试Z阵列也比在KMP中调试辅助阵列更容易。

结论

在本文中，首先，我们了解了 Z 算法的基本工作原理。之后，我们举了一个例子来更好地理解Z算法的工作原理。然后我们还研究了如何借助伪代码/算法为 Z 算法编写代码。接下来是对Z算法进行时间复杂度分析。之后，我们还在 C++ 中实现了 Z 算法并进行了解释，并在 JAVA 和 Python 中实现了 Z 算法。最后讨论了Z算法的应用和实例，并将Z算法与类似的算法进行了比较。

夸智网

字符串模式匹配：Z 算法

自己动手编写CPU，操作系统，算法与数据结构，数据库，计算机网络，TCP/IP协议栈，软件工程，设计模式，编译器.

Chirp Scaling算法距离多普勒算法【SAR成像】基于RD、CS和ωk算法的合成孔径雷达成像算法原理与实现

发表评论取消回复

夸智网

字符串模式匹配：Z 算法

自己动手编写CPU，操作系统，算法与数据结构，数据库，计算机网络，TCP/IP协议栈，软件工程，设计模式，编译器.

Chirp Scaling算法 距离多普勒算法 【SAR成像】基于RD、CS和ωk算法的合成孔径雷达成像算法原理与实现

相关文章

发表评论取消回复

Chirp Scaling算法距离多普勒算法【SAR成像】基于RD、CS和ωk算法的合成孔径雷达成像算法原理与实现