|
| 1 | +# Z Algorithm |
| 2 | + |
| 3 | +This algorithm finds all occurrences of a pattern in a text in linear time. Let length of text be n and of pattern be m, then total time taken is O(m + n) with linear space complexity. Now we can see that both time and space complexity is same as KMP algorithm but this algorithm is Simpler to understand. |
| 4 | + |
| 5 | +In this algorithm, we construct a Z array. |
| 6 | + |
| 7 | +# What is Z array? |
| 8 | + |
| 9 | +For a string str[0..n-1], Z array is of same length as string. An element Z[i] of Z array stores length of the longest substring starting from str[i] which is also a prefix of str[0..n-1]. The first entry of Z array is meaning less as complete string is always prefix of itself. |
| 10 | +Example: |
| 11 | +Index 0 1 2 3 4 5 6 7 8 9 10 11 |
| 12 | +Text a a b c a a b x a a a z |
| 13 | +Z values X 1 0 0 3 1 0 0 2 2 1 0 |
| 14 | + |
| 15 | +# How to construct Z array? |
| 16 | + |
| 17 | +The idea is to maintain an interval [L, R] which is the interval with max R |
| 18 | +such that [L,R] is prefix substring (substring which is also prefix). |
| 19 | + |
| 20 | +Steps for maintaining this interval are as follows – |
| 21 | + |
| 22 | +1) If i > R then there is no prefix substring that starts before i and |
| 23 | + ends after i, so we reset L and R and compute new [L,R] by comparing |
| 24 | + str[0..] to str[i..] and get Z[i] (= R-L+1). |
| 25 | + |
| 26 | +2) If i <= R then let K = i-L, now Z[i] >= min(Z[K], R-i+1) because |
| 27 | + str[i..] matches with str[K..] for atleast R-i+1 characters (they are in |
| 28 | + [L,R] interval which we know is a prefix substring). |
| 29 | + Now two sub cases arise – |
| 30 | + a) If Z[K] < R-i+1 then there is no prefix substring starting at |
| 31 | + str[i] (otherwise Z[K] would be larger) so Z[i] = Z[K] and |
| 32 | + interval [L,R] remains same. |
| 33 | + b) If Z[K] >= R-i+1 then it is possible to extend the [L,R] interval |
| 34 | + thus we will set L as i and start matching from str[R] onwards and |
| 35 | + get new R then we will update interval [L,R] and calculate Z[i] (=R-L+1) |
| 36 | + |
| 37 | +The algorithm runs in linear time because we never compare character less than R and with matching we increase R by one so there are at most T comparisons. In mismatch case, mismatch happen only once for each i (because of which R stops), that’s another at most T comparison making overall linear complexity. |
0 commit comments