Algorithms Substring Search suggest change

Introduction To Knuth-Morris-Pratt KMP Algorithm

Suppose that we have a text and a pattern. We need to determine if the pattern exists in the text or not. For example:

+-------+---+---+---+---+---+---+---+---+
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------+---+---+---+---+---+---+---+---+
|  Text | a | b | c | b | c | g | l | x |
+-------+---+---+---+---+---+---+---+---+

+---------+---+---+---+---+
| Index   | 0 | 1 | 2 | 3 |
+---------+---+---+---+---+
| Pattern | b | c | g | l |
+---------+---+---+---+---+

This pattern does exist in the text. So our substring search should return 3, the index of the position from which this pattern starts. So how does our brute force substring search procedure work?

What we usually do is: we start from the 0th index of the text and the 0th index of our *pattern and we compare Text[0] with Pattern[0]. Since they are not a match, we go to the next index of our text and we compare Text[1] with Pattern[0]. Since this is a match, we increment the index of our pattern and the index of the Text also. We compare Text[2] with Pattern[1]. They are also a match. Following the same procedure stated before, we now compare Text[3] with Pattern[2]. As they do not match, we start from the next position where we started finding the match. That is index 2 of the Text. We compare Text[2] with Pattern[0]. They don’t match. Then incrementing index of the Text, we compare Text[3] with Pattern[0]. They match. Again Text[4] and Pattern[1] match, Text[5] and Pattern[2] match and Text[6] and Pattern[3] match. Since we’ve reached the end of our Pattern, we now return the index from which our match started, that is 3. If our pattern was: bcgll, that means if the pattern didn’t exist in our text, our search should return exception or -1 or any other predefined value. We can clearly see that, in the worst case, this algorithm would take O(mn) time where m is the length of the Text and n is the length of the Pattern. How do we reduce this time complexity? This is where KMP Substring Search Algorithm comes into the picture.

The Knuth-Morris-Pratt String Searching Algorithm or KMP Algorithm searches for occurrences of a “Pattern” within a main “Text” by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. The algorithm was conceived in 1970 by Donuld Knuth and Vaughan Pratt and independently by James H. Morris. The trio published it jointly in 1977.

Let’s extend our example Text and Pattern for better understanding:

+-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Index |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10|11|12|13|14|15|16|17|18|19|20|21|22|
+-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|  Text |a |b |c |x |a |b |c |d |a |b |x |a |b |c |d |a |b |c |d |a |b |c |y |
+-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

+---------+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | y |
+---------+---+---+---+---+---+---+---+---+

At first, our Text and Pattern matches till index 2. Text[3] and Pattern[3] doesn’t match. So our aim is to not go backwards in this Text, that is, in case of a mismatch, we don’t want our matching to begin again from the position that we started matching with. To achieve that, we’ll look for a suffix in our Pattern right before our mismatch occurred (substring abc), which is also a prefix of the substring of our Pattern. For our example, since all the characters are unique, there is no suffix, that is the prefix of our matched substring. So what that means is, our next comparison will start from index 0. Hold on for a bit, you’ll understand why we did this. Next, we compare Text[3] with Pattern[0] and it doesn’t match. After that, for Text from index 4 to index 9 and for Pattern from index 0 to index 5, we find a match. We find a mismatch in Text[10] and Pattern[6]. So we take the substring from Pattern right before the point where mismatch occurs (substring abcdabc), we check for a suffix, that is also a prefix of this substring. We can see here ab is both the suffix and prefix of this substring. What that means is, since we’ve matched until Text[10], the characters right before the mismatch is ab. What we can infer from it is that since ab is also a prefix of the substring we took, we don’t have to check ab again and the next check can start from Text[10] and Pattern[2]. We didn’t have to look back to the whole Text, we can start directly from where our mismatch occurred. Now we check Text[10] and Pattern[2], since it’s a mismatch, and the substring before mismatch (abc) doesn’t contain a suffix which is also a prefix, we check Text[10] and Pattern[0], they don’t match. After that for Text from index 11 to index 17 and for Pattern from index 0 to index 6. We find a mismatch in Text[18] and Pattern[7]. So again we check the substring before mismatch (substring abcdabc) and find abc is both the suffix and the prefix. So since we matched till Pattern[7], abc must be before Text[18]. That means, we don’t need to compare until Text[17] and our comparison will start from Text[18] and Pattern[3]. Thus we will find a match and we’ll return 15 which is our starting index of the match. This is how our KMP Substring Search works using suffix and prefix information.

Now, how do we efficiently compute if suffix is same as prefix and at what point to start the check if there is a mismatch of character between Text and Pattern. Let’s take a look at an example:

+---------+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+

We’ll generate an array containing the required information. Let’s call the array S. The size of the array will be same as the length of the pattern. Since the first letter of the Pattern can’t be the suffix of any prefix, we’ll put S[0] = 0. We take i = 1 and j = 0 at first. At each step we compare Pattern[i] and Pattern[j] and increment i. If there is a match we put S[i] = j + 1 and increment j, if there is a mismatch, we check the previous value position of j (if available) and set j = S[j-1] (if j is not equal to 0), we keep doing this until S[j] doesn’t match with S[i] or j doesn’t become 0. For the later one, we put S[i] = 0. For our example:

j   i
+---------+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+

Pattern[j] and Pattern[i] don’t match, so we increment i and since j is 0, we don’t check the previous value and put Pattern[i] = 0. If we keep incrementing i, for i = 4, we’ll get a match, so we put S[i] = S[4] = j + 1 = 0 + 1 = 1 and increment j and i. Our array will look like:

j               i
+---------+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---------+---+---+---+---+---+---+---+---+
| Pattern | a | b | c | d | a | b | c | a |
+---------+---+---+---+---+---+---+---+---+
|    S    | 0 | 0 | 0 | 0 | 1 |   |   |   |
+---------+---+---+---+---+---+---+---+---+

Since Pattern[1] and Pattern[5] is a match, we put S[i] = S[5] = j + 1 = 1 + 1 = 2. If we continue, we’ll find a mismatch for j = 3 and i = 7. Since j is not equal to 0, we put j = S[j-1]. And we’ll compare the characters at i and j are same or not, since they are same, we’ll put S[i] = j + 1. Our completed array will look like:

+---------+---+---+---+---+---+---+---+---+
|    S    | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 |
+---------+---+---+---+---+---+---+---+---+

This is our required array. Here a nonzero-value of S[i] means there is a S[i] length suffix same as the prefix in that substring (substring from 0 to i) and the next comparison will start from S[i] + 1 position of the Pattern. Our algorithm to generate the array would look like:

Procedure GenerateSuffixArray(Pattern):
i := 1
j := 0
n := Pattern.length
while i is less than n
    if Pattern[i] is equal to Pattern[j]
        S[i] := j + 1
        j := j + 1
        i := i + 1
    else
        if j is not equal to 0
            j := S[j-1]
        else
            S[i] := 0
            i := i + 1
        end if
    end if
end while

The time complexity to build this array is O(n) and the space complexity is also O(n). To make sure if you have completely understood the algorithm, try to generate an array for pattern aabaabaa and check if the result matches with this one.

Now let’s do a substring search using the following example:

+---------+---+---+---+---+---+---+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |
+---------+---+---+---+---+---+---+---+---+---+---+---+---+
|   Text  | a | b | x | a | b | c | a | b | c | a | b | y |
+---------+---+---+---+---+---+---+---+---+---+---+---+---+

+---------+---+---+---+---+---+---+
|  Index  | 0 | 1 | 2 | 3 | 4 | 5 |
+---------+---+---+---+---+---+---+
| Pattern | a | b | c | a | b | y |
+---------+---+---+---+---+---+---+
|    S    | 0 | 0 | 0 | 1 | 2 | 0 |
+---------+---+---+---+---+---+---+

We have a Text, a Pattern and a pre-calculated array S using our logic defined before. We compare Text[0] and Pattern[0] and they are same. Text[1] and Pattern[1] are same. Text[2] and Pattern[2] are not same. We check the value at the position right before the mismatch. Since S[1] is 0, there is no suffix that is same as the prefix in our substring and our comparison starts at position S[1], which is 0. So Pattern[0] is not same as Text[2], so we move on. Text[3] is same as Pattern[0] and there is a match till Text[8] and Pattern[5]. We go one step back in the S array and find 2. So this means there is a prefix of length 2 which is also the suffix of this substring (abcab) which is ab. That also means that there is an ab before Text[8]. So we can safely ignore Pattern[0] and Pattern[1] and start our next comparison from Pattern[2] and Text[8]. If we continue, we’ll find the Pattern in the Text. Our procedure will look like:

Procedure KMP(Text, Pattern)
GenerateSuffixArray(Pattern)
m := Text.Length
n := Pattern.Length
i := 0
j := 0
while i is less than m
    if Pattern[j] is equal to Text[i]
        j := j + 1
        i := i + 1
    if j is equal to n
        Return (j-i)
    else if i < m and Pattern[j] is not equal t Text[i]
        if j is not equal to 0
            j = S[j-1]
        else
            i := i + 1
        end if
    end if
end while
Return -1

The time complexity of this algorithm apart from the Suffix Array Calculation is O(m). Since GenerateSuffixArray takes O(n), the total time complexity of KMP Algorithm is: O(m+n).

PS: If you want to find multiple occurrences of Pattern in the Text, instead of returning the value, print it/store it and set j := S[j-1]. Also keep a flag to track whether you have found any occurrence or not and handle it accordingly.

Feedback about page:

Feedback:
Optional: your email if you want me to get back to you:



Table Of Contents