String matching is a very common algorithm. What it does is, for the pattern string t, find the first occurrence of it in the target string s. For example, if t = "abcab", s = "ababcabd", the algorithm should return 2. Talking about string matching, we have to mention the KMP algorithm. KMP is a very powerful algorithm, which is very clever and can match strings in linear time complexity. In this article, we’ll learn about it.

This article will use Python’s slicing syntax to represent substrings. If you’re not familiar with Python, just remember that s[:k] means that s is prefixed by k in length, and s[-k:] means that s is suffixed by k in length. In addition, the subscript of the string starts from 0, so the last character of s[:k] is s[k-1].

Matching algorithm

Roughly speaking, the KMP algorithm exploits a property of strings: it is possible for the end of a string to match its own first part. For example, the string abcab has two characters at the end that can match its own first part:

1
2
3
   abcab
   ||
abcab

In formal language: For a string t, there exists a maximum number 0 <= m < len(t) such that t[:m] == t[-m:] holds. We call m the maximum first and last match length. Note that m must be less than the length of t, otherwise it is meaningless – any string is always equal to itself.

KMP first finds the maximum first and last match lengths for all prefixes of the pattern string, which are stored in the array pi. The maximum first and last match length for the prefix t[:i] is pi[i-1]. For example, the pi array for the string p = "abcab" is [0, 0, 0, 1, 2], because:

  • p[:1] = "a", the maximum first and last match length is 0, pi[0] = 0
  • p[:2] = "ab", maximum first and last match length is 0, pi[1] = 0
  • p[:3] = "abc", maximum first and last match length is 0, pi[2] = 0
  • p[:4] = "abca", the maximum first and last match length is 1, pi[0] = 1, because there is
    1
    2
    3
    
       abca
       |
    abca
    
  • p[:5] = "abcab", the maximum first and last match length is 2, pi[0] = 2, because
    1
    2
    3
    
       abcab
       ||
    abcab
    

Let’s not worry about how KMP finds the pi array, but let’s see how KMP uses the pi array to do matching.

Suppose the string s matches the pattern string t. We start with the 0th character, then the 1st, 2nd, and up to the kth character, and the match is successful; however, the kth + 1th character fails.

kmp_1

The match failed, so what do we do? Here’s the kicker. The first 0 to k characters match successfully, and the substring equals t[:k+1] . The pi array tells us that the last pi[k] characters of this substring are exactly equal to the first pi[k] characters of it.

image

So next we let s[k+1] compare with t[pi[k]], if it’s equal, we continue to match the next characters; if it’s not equal - no matter, t[:pi[k]] has already matched, we go back to the pi array to get the maximum length of the first and last matches of this substring, and compare the corresponding characters in the same way.

With this in mind, it is easy to write the code for the KMP algorithm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def kmp(s, t):
    pi = calc_pi(t) # 先不管如何计算 pi
    j = 0
    for i, c in enumerate(s):
        while j > 0 and t[j] != c: # t[j] 匹配失败, 但是 t[:j] 匹配成功
            j = pi[j-1] # t[:j] 的后 pi[j-1] 个字符与前 pi[j-1] 个字符相等, 下一步匹配 t[pi[j-1]]
        if t[j] == c:
            j += 1
        if j == len(t):
            return i - j + 1 # 返回起始位置

    return -1

Find the pi array

Since the pi array is so useful, how do we find it? First it’s easy to think of pi[0] = 0 , because the maximum first and last match length needs to be less than the string length. If we can use pi[0], pi[1], ... , pi[k] pushes out pi[k+1] , we can find the entire pi array.

Suppose we find pi[k] , which is the maximum first and last match length of t[:k+1].

image

What is the maximum first match length of pi[k+1], i.e., t[:k+2]? This is discussed in two cases. Suppose t[k+1] == t[pi[k]] , which is a simple case, pi[k+1] = pi[k] + 1 .

image

What if they’re not equal? That’s fine, the first few characters, t[:pi[k]], match, don’t they? Based on the pi array we just found, we know that for the substring t[:pi[k]], the last pi[pi[k]-1] characters are equal to the first pi[pi[k]-1] characters! This brings us back to the previous case of the matching algorithm.

image

Next we just need to compare t[k+1] with t[pi[pi[k]-1]]. If they are equal, then pi[k+1] = pi[pi[k]-1] + 1; if they are not equal, then we repeat the operation: query the pi array to get the maximum first and last matches of the previous equal parts, and then compare the corresponding characters.

In this way, the code for computing the pi array is easy to understand.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def calc_pi(t):
    pi = [0] * len(t) # pi[0] = 0
    j = 0 # j 等于 pi[-1]
    for i in range(1, len(t)):
        while j > 0 and t[i] != t[j]: # t[j] 匹配失败, 但是 t[:j] 匹配成功. 注意这里 j = pi[i-1]
            j = pi[j-1] # t[:j] 的后 pi[j-1] 个字符与前 pi[j-1] 个字符相等, 下一步匹配 t[pi[j-1]]
        if t[i] == t[j]:
            j += 1
        pi[i] = j

    return pi

The code to compute pi is very similar to the KMP matching algorithm. The matching algorithm matches the pattern string with the target string, while computing pi matches the pattern string with itself. The matching process only relies on the previously computed pi value, so it is a dynamic programming algorithm.

Summary

The following code puts the two parts of the algorithm together. To recap, KMP uses the matching algorithm to gradually derive the pi array, and then uses the same matching algorithm to match the pattern string with the target string using the previously derived pi. So the KMP algorithm is a very powerful algorithm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def kmp(s, t):
    if not t: return 0
    pi = [0] * len(t)
    j = 0
    for i in range(1, len(t)):
        while j > 0 and t[i] != t[j]:
            j = pi[j-1]
        if t[i] == t[j]:
            j += 1
        pi[i] = j

    j = 0
    for i, c in enumerate(s):
        while j > 0 and t[j] != c:
            j = pi[j-1]
        if t[j] == c:
            j += 1
        if j == len(t):
            return i - j + 1

    return -1