Suchicodes | Pattern Matching

Pattern Matching

Given a pattern \(P\) and a text \(T\) find the first/all occurances of the pattern \(P\) inside of text \(T\).

Find the index of the first occurrence in a string (leetcode question 28).

Note: Every algorithm listed here only makes sense if you are not allowed to compute an index on the text. Therefore, this is not suitable for things like search engines.

All these algorithm require at least \(O(m + n)\) running time, if the text gets large enough, then we won’t be able to find the pattern in reasonable time. These algorithms are suitable when the text is dynamic in nature – like a text editor or on a singular web page (Ctrl-F on your browser).

For searching with precompute, refer to the following notes page: https://old.suchicodes.com/resources/blog/14222f2d#text-indexing-structures

Brute-force Method

Just compare substring with every possible position in the original string.

def strStr(haystack: str, needle: str) -> int:
    for i in range(len(haystack)):
        if haystack[i:i+len(needle)] == needle:
            return i
    return -1

This requires \(O(nm)\) running time as at each iteration, we have to iterate over the entire substring.

Knuth-Moris-Pratt (KMP) Algorithm

The Knuth-Moris-Pratt is an algorithm that preprocesses the pattern to build a longest prefix-suffix table (LPS table), then scan the text efficiently using that table.

Here is a good video explaining the algorithm: https://youtu.be/ynv7bbcSLKE.

The longest prefix suffix of a pattern is the longest prefix that matches a proper suffix of the substring ending at a given position. The KMP algorithm first builds a table of the longest prefix suffix at every index.

Example: «abab» has a longest prefix «ab» because it is both a prefix and a suffix but not the whole string itself.

example table for lps

Here are few examples explained visually:

example of lps calculation

We can compute this table using the following code:

def LPS(pattern):
    m = len(pattern)
    lps = [0] * m
    j = 0
    i = 1

    while i < m:
        if pattern[i] == pattern[j]:
            j += 1
            lps[i] = j
            i += 1
        else:
            if j != 0:
                j = lps[j - 1]
            else:
                lps[i] = 0
                i += 1

    return lps

lps[0] is always 0 since a single character has no proper prefix or suffix.
i traverses the pattern from left to right.
j tracks the length of the current matching prefix-suffix.

If the characters at positions i and j match, increment j as we are existing the current prefix-suffix match by 1 and assign it to lps[i].

def strStr(self, haystack: str, needle: str) -> int:
    i = 0
    j = 0
    m = len(needle)
    n = len(haystack)
    lps = LPS(needle)

    while i < n:
        if needle[j] == haystack[i]:
            i += 1
            j += 1
            if j == m:
                return i - m
        else:
            if j != 0:
                j = lps[j-1]
            else:
                i += 1

    return -1

Boyer-Moore Algorithm

Boyer-Moore uses bad character and good suffix heuristics to skip sections of the text. It runs in \(O(n + m)\) but can degrade to \(O(nm)\) without good heuristics. However, in the best case it runs in sublinear time and average performance is excellent – we end up skipping processing a lot of the characters!

Ben Langmead has a great video explaining Boyer-Moore: https://www.youtube.com/watch?v=4Xyhb72LCX4.

In Boyer-Moore, we compare the text and the pattern right to left. We do this so that when there is a mismatch, we can skip ahead a lot more characters. The algorithm is based off two rules:

Bad Character Rule

When scanning from right to left, if we find a mismatch in the character, we have one of two case: If the character doesn’t exist in the pattern, we can skip so that the whole pattern is now after the character. If the character does exist, we can move the pattern until the mismatch becomes a match.

Boyer-Moore bad character rule example

Good Suffix Rule

If \(t\) is the substring that’s matched so far in the pattern, then we skip until there are no mismatches between \(P\) and \(t\) or \(P\) moves past \(t\). This can happen by either matching the suffix matches with an internal substring or a prefix of \(P\) that matches the suffix of \(t\).

Boyer-Moore good suffix rule example

Overall Algorithm

We align the pattern starting from the left going to the right and at each alignment, we compare starting from the right going to the left. Everytime we have a mismatch, we just check both rules and we skip whichever gives us the higher number.

Boyer-Moore overall algorithm

Rabin-Karp Algorithm

Rabin-Karp just uses a rolling hash function to first hash the pattern and then iterate over the text, comparing hashes.

Here’s a good video about it: https://youtu.be/yFHV7weZ_as?.

def rabin_karp(text: str, pattern: str) -> int:
    base = 256
    prime = 1000000007
    n = len(text)
    m = len(pattern)
    bpm = b ** (m-1)

    hash_pat = 0
    hash_txt = 0

    # compute initial hash values at the start
    for i in range(m):
        hash_pat = (base * hash_pat + ord(pattern[i])) % prime
        hash_txt = (base * hash_txt + ord(text[i])) % prime

    # slide pattern over text one-by-one
    for i in range(n - m + 1):
        if hash_pat == hash_txt:
            # if hash match, check if pattern match
            if text[i:i+m] == pattern:
                return i

        # roll hash by one character
        hash_txt = (hash_txt - ord(text[i]) * bpm) * base + ord(text[i+m])
        hash_txt = hash_txt % prime
    
    return -1

Which Algorithm Should I Use?

This is a table that was generated by ChatGPT:

Scenario	Characteristics	Recommended Algorithm	Why
Small text, small pattern	`n, m ≤ 100`	Naive Search	Minimal overhead; fastest in practice
Medium text, medium pattern	`n, m ≤ 10⁴`	KMP	Safe, linear, low memory, no hash issues
Large text, short pattern	e.g. text with millions of chars, pattern ≤ 100	Boyer-Moore-Horspool (BMH)	Very fast in practice due to large skips
Large text, repetitive pattern	pattern like `"aaaaaa"`	KMP	Handles repeated characters without slowdowns
Very large text, many searches	same pattern reused	KMP with precomputed LPS	Build once, reuse
Search for multiple patterns	e.g. blacklist scan	Aho-Corasick	Trie + BFS — finds all patterns in one pass
Binary/text with small alphabet	DNA (A,C,G,T), binaries	KMP or Suffix Automaton	BMH performs poorly on low entropy
Text or pattern is hashed	e.g. plagiarism detection	Rabin-Karp	Fast rolling hash, cheap equality checks
Find all occurrences	not just first match	KMP or Rabin-Karp (with hash verify)	Built for this; BMH doesn’t help
Memory-constrained	low RAM, embedded device	Naive or BMH	Low footprint
Worst-case resistance	avoid pathological slowdowns	KMP or Boyer-Moore Full	Guaranteed linear time
Unicode / multibyte input	e.g. UTF-8 or wide strings	KMP, Naive, or locale-aware tools	Hash-based approaches can break on encoding boundaries

More Algorithms To Explore

Boyer-Moore-Horspool (BMH): only bad-character rule from Boyer-Moore, runs with less overhead.
Aho-Corasick: find all search patterns in text in a single pass using Tries + BFS.

Settings

Pattern Matching

Table of Contents

Pattern Matching

Brute-force Method

Knuth-Moris-Pratt (KMP) Algorithm

Boyer-Moore Algorithm

Bad Character Rule

Good Suffix Rule

Overall Algorithm

Rabin-Karp Algorithm

Which Algorithm Should I Use?

More Algorithms To Explore