# Introduction To Knuth-Morris-Pratt KMP Algorithm

Suppose that we have a *text* and a *pattern*. We need to determine if the pattern exists in the text or not. For example:

+-------+---+---+---+---+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +-------+---+---+---+---+---+---+---+---+ | Text | a | b | c | b | c | g | l | x | +-------+---+---+---+---+---+---+---+---+ +---------+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | +---------+---+---+---+---+ | Pattern | b | c | g | l | +---------+---+---+---+---+

This *pattern* does exist in the *text*. So our substring search should return **3**, the index of the position from which this *pattern* starts. So how does our brute force substring search procedure work?

What we usually do is: we start from the **0th** index of the *text* and the **0th** index of our *pattern and we compare **Text[0]** with **Pattern[0]**. Since they are not a match, we go to the next index of our *text* and we compare **Text[1]** with **Pattern[0]**. Since this is a match, we increment the index of our *pattern* and the index of the *Text* also. We compare **Text[2]** with **Pattern[1]**. They are also a match. Following the same procedure stated before, we now compare **Text[3]** with **Pattern[2]**. As they do not match, we start from the next position where we started finding the match. That is index **2** of the *Text*. We compare **Text[2]** with **Pattern[0]**. They don’t match. Then incrementing index of the *Text*, we compare **Text[3]** with **Pattern[0]**. They match. Again **Text[4]** and **Pattern[1]** match, **Text[5]** and **Pattern[2]** match and **Text[6]** and **Pattern[3]** match. Since we’ve reached the end of our *Pattern*, we now return the index from which our match started, that is **3**. If our *pattern* was: `bcgll`

, that means if the *pattern* didn’t exist in our *text*, our search should return exception or **-1** or any other predefined value. We can clearly see that, in the worst case, this algorithm would take `O(mn)`

time where **m** is the length of the *Text* and **n** is the length of the *Pattern*. How do we reduce this time complexity? This is where KMP Substring Search Algorithm comes into the picture.

The Knuth-Morris-Pratt String Searching Algorithm or KMP Algorithm searches for occurrences of a “Pattern” within a main “Text” by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. The algorithm was conceived in 1970 by Donuld Knuth and Vaughan Pratt and independently by James H. Morris. The trio published it jointly in 1977.

Let’s extend our example *Text* and *Pattern* for better understanding:

+-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | Index |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10|11|12|13|14|15|16|17|18|19|20|21|22| +-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | Text |a |b |c |x |a |b |c |d |a |b |x |a |b |c |d |a |b |c |d |a |b |c |y | +-------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ +---------+---+---+---+---+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---------+---+---+---+---+---+---+---+---+ | Pattern | a | b | c | d | a | b | c | y | +---------+---+---+---+---+---+---+---+---+

At first, our *Text* and *Pattern* matches till index **2**. **Text[3]** and **Pattern[3]** doesn’t match. So our aim is to not go backwards in this *Text*, that is, in case of a mismatch, we don’t want our matching to begin again from the position that we started matching with. To achieve that, we’ll look for a **suffix** in our *Pattern* right before our mismatch occurred (substring **abc**), which is also a **prefix** of the substring of our *Pattern*. For our example, since all the characters are unique, there is no suffix, that is the prefix of our matched substring. So what that means is, our next comparison will start from index **0**. Hold on for a bit, you’ll understand why we did this. Next, we compare **Text[3]** with **Pattern[0]** and it doesn’t match. After that, for *Text* from index **4** to index **9** and for *Pattern* from index **0** to index **5**, we find a match. We find a mismatch in **Text[10]** and **Pattern[6]**. So we take the substring from *Pattern* right before the point where mismatch occurs (substring **abcdabc**), we check for a suffix, that is also a prefix of this substring. We can see here **ab** is both the suffix and prefix of this substring. What that means is, since we’ve matched until **Text[10]**, the characters right before the mismatch is **ab**. What we can infer from it is that since **ab** is also a prefix of the substring we took, we don’t have to check **ab** again and the next check can start from **Text[10]** and **Pattern[2]**. We didn’t have to look back to the whole *Text*, we can start directly from where our mismatch occurred. Now we check **Text[10]** and **Pattern[2]**, since it’s a mismatch, and the substring before mismatch (**abc**) doesn’t contain a suffix which is also a prefix, we check **Text[10]** and **Pattern[0]**, they don’t match. After that for *Text* from index **11** to index **17** and for *Pattern* from index **0** to index **6**. We find a mismatch in **Text[18]** and **Pattern[7]**. So again we check the substring before mismatch (substring **abcdabc**) and find **abc** is both the suffix and the prefix. So since we matched till **Pattern[7]**, **abc** must be before **Text[18]**. That means, we don’t need to compare until **Text[17]** and our comparison will start from **Text[18]** and **Pattern[3]**. Thus we will find a match and we’ll return **15** which is our starting index of the match. This is how our KMP Substring Search works using suffix and prefix information.

Now, how do we efficiently compute if suffix is same as prefix and at what point to start the check if there is a mismatch of character between *Text* and *Pattern*. Let’s take a look at an example:

+---------+---+---+---+---+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---------+---+---+---+---+---+---+---+---+ | Pattern | a | b | c | d | a | b | c | a | +---------+---+---+---+---+---+---+---+---+

We’ll generate an array containing the required information. Let’s call the array **S**. The size of the array will be same as the length of the pattern. Since the first letter of the *Pattern* can’t be the suffix of any prefix, we’ll put **S[0]** = **0**. We take **i** = **1** and **j** = **0** at first. At each step we compare **Pattern[i]** and **Pattern[j]** and increment **i**. If there is a match we put **S[i]** = **j** + **1** and increment **j**, if there is a mismatch, we check the previous value position of **j** (if available) and set **j** = **S[j-1]** (if **j** is not equal to **0**), we keep doing this until **S[j]** doesn’t match with **S[i]** or **j** doesn’t become **0**. For the later one, we put **S[i]** = **0**. For our example:

j i +---------+---+---+---+---+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---------+---+---+---+---+---+---+---+---+ | Pattern | a | b | c | d | a | b | c | a | +---------+---+---+---+---+---+---+---+---+

**Pattern[j]** and **Pattern[i]** don’t match, so we increment **i** and since **j** is **0**, we don’t check the previous value and put **Pattern[i]** = **0**. If we keep incrementing **i**, for **i** = **4**, we’ll get a match, so we put **S[i]** = **S[4]** = **j** + **1** = **0** + **1** = **1** and increment **j** and **i**. Our array will look like:

j i +---------+---+---+---+---+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---------+---+---+---+---+---+---+---+---+ | Pattern | a | b | c | d | a | b | c | a | +---------+---+---+---+---+---+---+---+---+ | S | 0 | 0 | 0 | 0 | 1 | | | | +---------+---+---+---+---+---+---+---+---+

Since **Pattern[1]** and **Pattern[5]** is a match, we put **S[i]** = **S[5]** = **j** + **1** = **1** + **1** = **2**. If we continue, we’ll find a mismatch for **j** = **3** and **i** = **7**. Since **j** is not equal to **0**, we put **j** = **S[j-1]**. And we’ll compare the characters at **i** and **j** are same or not, since they are same, we’ll put **S[i]** = **j** + 1. Our completed array will look like:

+---------+---+---+---+---+---+---+---+---+ | S | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 1 | +---------+---+---+---+---+---+---+---+---+

This is our required array. Here a nonzero-value of **S[i]** means there is a **S[i]** length suffix same as the prefix in that substring (substring from **0** to **i**) and the next comparison will start from **S[i]** + **1** position of the *Pattern*. Our algorithm to generate the array would look like:

Procedure GenerateSuffixArray(Pattern): i := 1 j := 0 n := Pattern.length while i is less than n if Pattern[i] is equal to Pattern[j] S[i] := j + 1 j := j + 1 i := i + 1 else if j is not equal to 0 j := S[j-1] else S[i] := 0 i := i + 1 end if end if end while

The time complexity to build this array is `O(n)`

and the space complexity is also `O(n)`

. To make sure if you have completely understood the algorithm, try to generate an array for pattern `aabaabaa`

and check if the result matches with this one.

Now let’s do a substring search using the following example:

+---------+---+---+---+---+---+---+---+---+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 | +---------+---+---+---+---+---+---+---+---+---+---+---+---+ | Text | a | b | x | a | b | c | a | b | c | a | b | y | +---------+---+---+---+---+---+---+---+---+---+---+---+---+ +---------+---+---+---+---+---+---+ | Index | 0 | 1 | 2 | 3 | 4 | 5 | +---------+---+---+---+---+---+---+ | Pattern | a | b | c | a | b | y | +---------+---+---+---+---+---+---+ | S | 0 | 0 | 0 | 1 | 2 | 0 | +---------+---+---+---+---+---+---+

We have a *Text*, a *Pattern* and a pre-calculated array *S* using our logic defined before. We compare **Text[0]** and **Pattern[0]** and they are same. **Text[1]** and **Pattern[1]** are same. **Text[2]** and **Pattern[2]** are not same. We check the value at the position right before the mismatch. Since **S[1]** is **0**, there is no suffix that is same as the prefix in our substring and our comparison starts at position **S[1]**, which is **0**. So **Pattern[0]** is not same as **Text[2]**, so we move on. **Text[3]** is same as **Pattern[0]** and there is a match till **Text[8]** and **Pattern[5]**. We go one step back in the **S** array and find **2**. So this means there is a prefix of length **2** which is also the suffix of this substring (**abcab)** which is **ab**. That also means that there is an **ab** before **Text[8]**. So we can safely ignore **Pattern[0]** and **Pattern[1]** and start our next comparison from **Pattern[2]** and **Text[8]**. If we continue, we’ll find the *Pattern* in the *Text*. Our procedure will look like:

Procedure KMP(Text, Pattern) GenerateSuffixArray(Pattern) m := Text.Length n := Pattern.Length i := 0 j := 0 while i is less than m if Pattern[j] is equal to Text[i] j := j + 1 i := i + 1 if j is equal to n Return (j-i) else if i < m and Pattern[j] is not equal t Text[i] if j is not equal to 0 j = S[j-1] else i := i + 1 end if end if end while Return -1

The time complexity of this algorithm apart from the Suffix Array Calculation is `O(m)`

. Since *GenerateSuffixArray* takes `O(n)`

, the total time complexity of KMP Algorithm is: `O(m+n)`

.

PS: If you want to find multiple occurrences of *Pattern* in the *Text*, instead of returning the value, print it/store it and set `j := S[j-1]`

. Also keep a `flag`

to track whether you have found any occurrence or not and handle it accordingly.