Knuth-Morris-Pratt Algorithm
Definition:​
The Knuth-Morris-Pratt (KMP) Algorithm is a string matching algorithm developed by Donald Knuth, James H. Morris, and Vaughan Pratt. It improves upon naive string matching by utilizing information about previous character matches to avoid unnecessary comparisons, making it particularly efficient for patterns containing repeating subsequences.
Example with Explanation​
Consider a scenario where we have the following:
- Text:
ABABDABACDABABCABAB
- Pattern:
ABABCABAB
The goal of the KMP algorithm is to efficiently find all occurrences of the pattern in the text by using a precomputed "failure function."
1. Failure Function (Partial Match Table)​
For the given pattern, the KMP algorithm builds a failure function (also called the partial match table) to help track the longest prefix that is also a suffix up to each position in the pattern. The failure function for the pattern ABABCABAB
is constructed as follows:
-
Initialization:
-
We start by setting
Ï€[1] = 0
for the first character in the pattern, as it has no proper prefix or suffix. -
The table at this stage:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | | | | | | | | |
-
-
Step-by-Step Table Construction:
-
At i = 2 (Substring
AB
):-
The prefix
A
and suffixB
are different, so there is no prefix that is also a suffix. -
We set
Ï€[2] = 0
. -
Updated table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | | | | | | | |
-
-
At i = 3 (Substring
ABA
):-
The longest prefix which is also a suffix is
A
(length 1). -
We set
Ï€[3] = 1
. -
Updated table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | 1 | | | | | | |
-
-
At i = 4 (Substring
ABAB
):-
The longest prefix which is also a suffix is
AB
(length 2). -
We set
Ï€[4] = 2
. -
Updated table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | 1 | 2 | | | | | |
-
-
At i = 5 (Substring
ABABC
):-
There is no prefix that is also a suffix here.
-
We set
Ï€[5] = 0
. -
Updated table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | 1 | 2 | 0 | | | | |
-
-
At i = 6 (Substring
ABABCA
):-
The longest prefix-suffix match is
A
(length 1). -
We set
Ï€[6] = 1
. -
Updated table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | 1 | 2 | 0 | 1 | | | |
-
-
At i = 7 (Substring
ABABCAB
):-
The longest prefix-suffix match is
AB
(length 2). -
We set
Ï€[7] = 2
. -
Updated table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | 1 | 2 | 0 | 1 | 2 | | |
-
-
At i = 8 (Substring
ABABCABA
):-
The longest prefix-suffix match is
ABA
(length 3). -
We set
Ï€[8] = 3
. -
Updated table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | |
-
-
At i = 9 (Substring
ABABCABAB
):-
The longest prefix-suffix match is
ABAB
(length 4). -
We set
Ï€[9] = 4
. -
Final table:
| Sequence Position (i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------------------------|---|---|---|---|---|---|---|---|---|
| Pattern (P[i]) | A | B | A | B | C | A | B | A | B |
| Failure Function (Ï€[i]) | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 |
-
-
This failure function table helps optimize the KMP pattern matching process by indicating how much the pattern should shift on a mismatch, ensuring no redundant comparisons are made.
2. Matching Phase​
The KMP algorithm scans the text from left to right while comparing the pattern to each substring in the text. When a mismatch occurs, the algorithm uses the failure function to shift the pattern appropriately, ensuring that already-matched characters do not need to be rechecked.
Step-by-Step Matching for ABABCABAB
in ABABDABACDABABCABAB
:​
-
Initialize:
- Text (T) =
ABABDABACDABABCABAB
- Pattern (P) =
ABABCABAB
- Start at the first character of the text and pattern.
- Text (T) =
-
Match Characters:
- T[0] to T[4] (Text substring
ABABD
) matches P[0] to P[4]. A mismatch occurs at T[4] (D) and P[4] (C).
- T[0] to T[4] (Text substring
-
Use Failure Function:
- At mismatch, 4 characters have matched (
k = 4
). - Use the formula:
shift = k - π[k]
, where π[4] = 2. - Shift:
shift = 4 - 2 = 2
. Move the pattern 2 characters to the right.
- At mismatch, 4 characters have matched (
-
Continue Matching:
- Resume comparison with T[2] to T[7] and P[0] to P[5]. Another mismatch occurs at T[7] (C) and P[5] (A).
- With k = 3 matched characters, π[3] = 1.
- Shift:
shift = 3 - 1 = 2
. Move the pattern 2 characters right.
-
Repeat Process:
- Continue aligning and comparing until T[10] to T[18] perfectly matches P[0] to P[8].
- Pattern
ABABCABAB
is found at position 10 in the text.
-
Result:
- Pattern
ABABCABAB
occurs at index 10 in the text.
- Pattern
By using the failure function for shifts, KMP avoids redundant comparisons, resulting in a linear time complexity for the matching phase.
Characteristics:​
-
Failure Function:
- Computes partial match table (also called failure function)
- Tracks longest proper prefixes that are also suffixes
- Enables efficient backtracking
-
Linear Time Matching:
- Never backtracks in the main text
- Utilizes preprocessed information
- Maintains constant space complexity
-
Preprocessing Phase:
- Builds failure function array
- Analyzes pattern structure
- Enables optimal shifts during search
-
Left-to-Right Scanning:
- Scans both pattern and text left to right
- Uses failure function for mismatches
- Maintains linear time complexity
Time Complexity:​
-
Preprocessing:
- Where m is pattern length
- Constructs failure function
- One-time computation
-
Searching:
- Where n is text length
- Linear time guaranteed
- No backtracking in text
Space Complexity:​
- Space Usage:
- Failure function array
- No additional dynamic space
- Independent of text length
C++ Implementation:​
#include <iostream>
#include <vector>
#include <string>
using namespace std;
class KMPAlgorithm {
private:
// Compute the failure function
vector<int> computeLPSArray(const string& pattern) {
int m = pattern.length();
vector<int> lps(m, 0);
int len = 0; // length of previous longest prefix suffix
int i = 1;
while (i < m) {
if (pattern[i] == pattern[len]) {
len++;
lps[i] = len;
i++;
} else {
if (len != 0) {
len = lps[len - 1];
} else {
lps[i] = 0;
i++;
}
}
}
return lps;
}
public:
vector<int> search(const string& text, const string& pattern) {
vector<int> matches;
int n = text.length();
int m = pattern.length();
if (m == 0 || m > n) return matches;
// Preprocessing: compute failure function
vector<int> lps = computeLPSArray(pattern);
// Searching phase
int i = 0; // index for text
int j = 0; // index for pattern
while (i < n) {
if (pattern[j] == text[i]) {
i++;
j++;
}
if (j == m) {
// Pattern found
matches.push_back(i - j);
j = lps[j - 1];
} else if (i < n && pattern[j] != text[i]) {
if (j != 0) {
j = lps[j - 1];
} else {
i++;
}
}
}
return matches;
}
};
// Demonstration class
class KMPDemo {
public:
static void demonstrateSearch() {
KMPAlgorithm algo;
string text = "ABABDABACDABABCABAB";
string pattern = "ABABCABAB";
cout << "Text: " << text << endl;
cout << "Pattern: " << pattern << endl;
vector<int> matches = algo.search(text, pattern);
cout << "Pattern found at positions: ";
for (int pos : matches) {
cout << pos << " ";
}
cout << endl;
}
};
int main() {
KMPDemo::demonstrateSearch();
return 0;
}
Key Features:​
-
Failure Function:
- Efficient prefix-suffix matching
- Optimal shift calculation
- Pattern preprocessing
-
Linear Time Guarantee:
- No backtracking in text
- Optimal worst-case complexity
- Efficient pattern matching
-
Optimization Techniques:
- Preprocessed pattern analysis
- Efficient state transitions
- Minimal character comparisons
Applications:​
-
Text Processing:
- Text editors
- File searching
- Document analysis
-
Bioinformatics:
- DNA sequence matching
- Protein pattern analysis
- Genome sequencing
-
Network Security:
- Packet inspection
- Signature detection
- Data filtering
-
Information Retrieval:
- Pattern matching
- Text mining
- Content searching
Advanced Features:​
-
Algorithm Variants:
- Multiple pattern matching
- Circular string matching
- Approximate matching
-
Implementation Optimizations:
- Cache-friendly versions
- Parallel implementations
- Memory-efficient variants
Comparison with Other Algorithms:​
-
Advantages:
- Linear time guarantee
- No text backtracking
- Efficient for repetitive patterns
-
Trade-offs:
- Preprocessing overhead
- Additional space requirement
- Complex implementation
Performance Characteristics:​
-
Best Case:
- O(n) comparisons
- Linear scanning
- Immediate mismatches
-
Average Case:
- O(n) performance
- Consistent behavior
- Pattern-independent
-
Worst Case:
- O(n) guaranteed
- No performance degradation
- Stable behavior
Summary:​
The Knuth-Morris-Pratt Algorithm represents a significant advancement in string matching algorithms by introducing the concept of the failure function. This innovation allows the algorithm to avoid unnecessary comparisons by utilizing information about previous matches, resulting in a guaranteed linear-time performance.
The algorithm's efficiency comes from its ability to preprocess the pattern and build a failure function that enables optimal shifts during the matching phase. While it requires additional space for preprocessing, the KMP algorithm's guarantee of linear-time performance and its ability to handle repetitive patterns make it a valuable tool in various applications, from text processing to bioinformatics.
The implementation combines elegant theoretical concepts with practical efficiency, making it a fundamental algorithm in computer science. Its consistent performance characteristics and lack of text backtracking make it particularly suitable for real-time applications and streaming data processing.