All Articles

Regular Expression Matching

Image from Unsplash by Christian Chen
Image from Unsplash by Christian Chen

From LeetCode:

Given an input string s and a pattern p, implement regular expression matching with support for '.' and '*'.

'.' Matches any single character.

'*' Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

Input: s = "aa", p = "a"
Output: False
Input: s = "aa", p = "a*"
Output: True

No collection of algorithmic problems is complete without an exploration of regular expressions. As developers, we frequently encounter RegEx patterns and the strict set of rules surrounding their use allows them to be conveniently framed as coding challenges.

We want to see if our RegEx pattern fully matches the given string, so a reasonable approach would be to consider subsets of the inputs and gradually work our way up (i.e. in a bottom-up fashion) to a comparison of both parameters in their complete form. We covered this idea of breaking down the dimensions of a problem in Maximum Profit from K Transactions.

For starters, let’s initialise our DP matrix:

def is_match(s, p):
    n, m = len(s), len(p)

    # DP[i][j] indicates if there's a match
    # between s[:i] and p[:j]
    DP = [
        [False for _ in range(n+1)]
        for _ in range(m+1)
    ]

Our DP matrix maps every relevant substring to every relevant “subpattern”, storing True if a match exists and False otherwise. Notice that i and j are exclusive indices this time –– we’re explicitly storing the solutions for the empty substrings (s[:0]) and empty subpatterns (p[:0]) instead of using a helper function to handle those edge cases.

Generally speaking, you should be flexible with how you reason about i and j in DP matrices and arrays. Don’t limit yourself to thinking about them as the index of last element –– they could represent the index of the element after the last (i.e. the size of your subset).

As an aside, we know that the wildcard character '.' matches any letter, so we’ll quickly define a helper for character-to-character comparisons:

def is_eq(c1, c2):
    if '.' in [c1, c2]:
        return True
    return c1 == c2

Base Cases

Before setting out to use smaller subproblems to solve larger ones, we’ll need to define the results of our base (i.e. empty) cases.

First, consider empty substrings versus empty subpatterns. Since both are empty, they are matching, so DP[0][0] must be True.

DP[0][0] = True

Now let’s think about the relationship between empty substrings and non-empty subpatterns. An empty string can be matched by ".*", ".*a*", or ".*a*b*" because '*', known as a quantifier, allows for zero instances of the letter preceding it.

As such, we can consider p[:j] a match for s[:0] if p[:j] is solely composed of character-quantifier pairs.

for j in range(2, len(m)+1):
    DP[0][j] = DP[0][j-2] if j-1 == '*' else False

What about the relationship between non-empty substrings and empty subpatterns? This will always be False, because an empty subpattern cannot match a non-empty string.

# Optional since we initialised DP[i][j] as False
for i in range(1, len(n)+1):
    DP[i][0] = False

Intermediate Cases

def is_match(s, p):
    # Init DP, handle bases cases

    for i in range(1, len(n)+1):
        for j in range(1, len(m)+1):
            # Make some magic happen ✨ 

With our base cases defined, we can dig into the meat and potatoes of our algorithm and use our answers to smaller subproblems to answer larger ones.

There are quite a number of cases to account for, but don’t worry –– we’ll go through them one-by-one.

Matching Tails

Let’s start with the simplest case. If the subpattern p[:j-1] matches the substring s[:i-1] and the character at p[j-1] matches the character at s[i-1], then it must be that p[:j] matches s[:i].

if DP[i-1][j-1] and is_eq(p[j-1], s[i-1]):
    DP[i][j] = True

# x matches x -> xy matches xy

Now we’ll consider more complex cases that involve the quantifier '*'.

Single Instance Quantifier

If the subpattern p[:j-1] matches the substring s[:i] and p[j-1] == '*', then it must be that the subpattern p[:j] also matches the substring s[:i]. The quantifier '*' can be used to denote a single instance of the character preceding it.

In other words, p[:j] should match whatever p[:j-1] is able to match.

if p[j-1] == "*":
    if DP[i][j-1]:
        DP[i][j] = True

# x matches x -> x* matches x

Multiple Instance Quantifier

If the subpattern p[:j] matches the substring s[:i-1] and the character preceding the quantifier p[j-2] matches s[i-1], then it must be that p[:j] also matches s[:i]. This is because a character-quantifier pair like "x*" can be used to run a match against letters multiple times (e.g. "xxxx").

if p[j-1] == "*" and j >= 2:
    if DP[i-1][j] and is_eq(p[j-2], p[i-1]):
        DP[i][j] = True


# x* matches x -> x* matches xx

Zero Instance Quantifier

If the subpattern p[:j-2] matches the substring s[:i], then it must be that p[:j] matches s[:i], since '*' can denote zero instances of the character preceding it.

In other words, p[:j] should match whatever p[:j-2] is able to match.

if p[j-1] == "*" and j >= 2:
    if DP[i][j-2]:
        DP[i][j] = True

# x matches x -> xy* matches x

Putting It All Together

Let’s put together all of the steps we’ve covered into a single function. Recall that we’ll start by initialising the state of our DP matrix to account for empty substrings / empty subpatterns before iterating through the subproblems with non-empty parameters.

# Helper function
def is_eq(c1, c2):         
    if '.' in [c1, c2]:
        return True
    return c1 == c2


def is_match(s, p):

    n, m = len(s), len(p)
    
    # DP[i][j] indicates if p[:j] matches s[:i]
    DP = [
        [False for _ in range(m+1)]
        for _ in range(n+1)
    ]
    
    # Empty string, empty pattern
    DP[0][0] = True

    # Empty string, non-empty pattern
    for j in range(2, m+1, 2):
        if DP[0][j-2] and p[j-1] == '*':
            DP[0][j] = True

    # Optional:
    # Empty pattern string vs empty pattern
    for i in range(1, n+1):
        DP[i][0] = False
    
    for i in range(1, n+1):
        for j in range(1, m+1):

            # Case 1: Matching tails
            # (x matches x -> xy matches xy)
            if DP[i-1][j-1] and is_eq(p[j-1], s[i-1]):
                DP[i][j] = True

            # Case 2: With '*' Quantifier
            if p[j-1] == '*':

                # Case 2A: Single instance quantifier
                # (x matches x -> x* matches x)
                if DP[i][j-1]:
                    DP[i][j] = True
                
                # Consider character-quantifier pairs
                if j >= 2:
                    
                    # Case 2B: Multiple instance quantifier
                    # (x* matches x -> x* matches xx)
                    if DP[i-1][j] and is_eq(p[j-2], s[i-1]):
                        DP[i][j] = True

                    # Case 2C: Zero instance quantifier
                    # (x matches x -> xy* matches x)
                    if DP[i][j-2]:
                        DP[i][j] = True

    return DP[-1][-1]

I’ve taken the liberty of being a bit more verbose with the code, but know that you can easily shorten some of these lines. For instance:

if DP[i][j-2]:
    DP[i][j] = True

can simply be written as

DP[i][j] |= DP[i][j-2]

For every DP[i][j], we’re able to determine its boolean value by referencing the subproblems preceding it (e.g. DP[i-1][j-1]) and evaluating the “marginal inputs” (e.g. s[i-1] and p[j-1]).

You can think of this algorithm as a process of going through a list of four cases / subproblems for every DP[i][j]. If we find that p[:j] matches s[:i] for a particular case, we’ll mark DP[i][j] as True.

Given a string of length n and a pattern of length m, our algorithm runs in O(nm) space and O(nm) time.

With that, we’ve solved the Regular Expression Matching problem 🆎🎰🔤.

Full Solution

def is_match(s, p):

    n, m = len(s), len(p)
    
    # Initialise DP array
    DP = [
        [False for _ in range(m+1)]
        for _ in range(n+1)
    ]

    # Empty substring vs empty pattern
    DP[0][0] = True

    # Empty substring vs non-empty subpatterns
    for j in range(2, m+1, 2):
        if p[j-1] == '*':
            DP[0][j] |= DP[0][j-2]

    # Non-empty substrings vs non-empty subpatterns
    for i in range(1, n+1):
        for j in range(1, m+1):

            # Case 1: Matching tails
            # (x matches x -> xy matches xy)
            if is_eq(p[j-1], s[i-1]):
                DP[i][j] |= DP[i-1][j-1]

            # Case 2: With '*' Quantifier
            if p[j-1] == '*':

                # Case 2A: Single instance quantifier
                # (x matches x -> x* matches x)
                DP[i][j] |= DP[i][j-1]
                
                # Consider character-quantifier pairs
                if j >= 2:
                    
                    # Case 2B: Multiple instance quantifier
                    # (x* matches x -> x* matches xx)
                    if is_eq(p[j-2], s[i-1]):
                        DP[i][j] |= DP[i-1][j]

                    # Case 2C: Zero instance quantifier
                    # (x matches x -> xy* matches x)
                    DP[i][j] |= DP[i][j-2]

    return DP[-1][-1]