All Articles

Alien Dictionary

Image from Unsplash by Robynne Hu
Image from Unsplash by Robynne Hu

28 Coding Challenges in 28 Days

Writing about code has plenty of perks, but most of all, it’s been fun.

Over the course of February, I’ll be reviewing 28 algorithmic problems. Specifically, I’ll be walking through one problem a day. On the last day of the month (the 29th) I’ll compile these problems into a “megapost” of sorts so that it’ll be easier to make sense of the content.

Without further ado, let’s dive into our very first problem of the day.

The Alien Dictionary Problem

From LeetCode:

There is a new alien language which uses the latin alphabet. However, the order among letters is unknown to you. You receive a list of non-empty words from the dictionary, where words are sorted lexicographically by the rules of this new language. Derive the order of letters in this language.

Input: ["wrt", "wrf", "er", "ett", "rftt"]
Output: "wertf"
Input: ["z","x","z"]
Output: "" # No valid ordering

Establishing Relationships Between Letters

The first thing to recognise is that the relationships between letters can be represented by a directed graph. This is common pattern you’ll find in replicated in multiple algorithmic problems, so you’ll want to learn to apply it well!

Let’s define the structure of the nodes in our graph:

class Node:
    def __init__(self, label):
        self.label = label
        # u -> v indicates that u is sorted AFTER v
        self.to = set()

In order to obtain the complete set of nodes, we’ll traverse the list of words to find all unique letters:

def alien_dictionary(words):
    n = len(words)

    # Init nodes
    nodes = {}
    for word in words:
        for c in word:
            if c not in nodes:
                nodes[c] = Node(c)

    # ...

With these nodes initialised, we’ll zip() through pair of words to find letter-to-letter sorting relationships. These relationships will be represented by the edges of our graph.

Note that, for a given pair of strings like ["foobar", "fosqux"], we only care to compare the substrings foo and fos, which tell us that o precedes s. The remaining substrings are no longer relevant because the strings have diverged (i.e. bar and qux can no longer be compared to each other).

# Build edges in graph
for i in range(n-1):
    for c1, c2 in zip(words[i], words[i+1]):
        if c1 == c2:
            continue
        elif c1 != c2:
            nodes[c2].to.add(c1)
            break

An edge going from letter a to letter b tells us that a’s ordering in the alien dictionary comes after b‘s. How you define the semantics behind your edges is, of course, up to you –– but I believe this is easier to reason about.

Sorting Letters Topologically

At this point, we’ve managed to represent the relative orders between pairs of letters (e.g. a vs b, c vs d) using edges. Our goal, however, is to coalesce all of these pairwise relationships into a single, linear ordering (e.g. abcd).

In other words, we want to run a topological sorting algorithm that will help us achieve a topological ordering on all of the letters.

Sorting letters topologically primarily involves running one or more depth-first traversals on our graph. The first item to be included in this ordering will be a terminal node (i.e. a node that only has edges pointing to it. When the DFS algorithm backtracks, this terminal node’s ancestors will be appended to the ordering.

"""
a --> b --> c
      |
      V
      d
"""
# Running a DFS from the node 'a' gives us
# "cdba" or "dcba", depending on whether
# we visit 'c' or 'd' first from 'b'

Note that it is possible for cyclic relationships to exist between letters. A cycle in our graph just tells us that the word list provided to our algorithm isn’t a valid one. In this case, there’s no possible way to topologically sort the letters, so we’ll just return an empty string (i.e. "").

Let’s express the topological sorting algorithm in code:

# Run topo sort
visiting = set()
res = []

def topo_sort(label):
    visiting.add(label)

    node = nodes[label]
    for v_label in node.to:

        # If we encounter a letter we've seen before,
        # then there's no way to derive a
        # topological ordering
        if v_label in visiting:
            return False

        if v_label in nodes:
            # Propogate detected cycle
            if topo_sort(v_label) is False:
                return False

    res.append(label)

    # Delete node so we don't visit it again
    del nodes[label] 

    visiting.remove(label)

    # Run topo sort on all nodes
    while nodes:
        label = next(iter(nodes))
        if topo_sort(label) is False:
            return ""

    return ''.join(res)

We check for repeated sightings of the same letter by maintaining a visiting set. When a cycle is detected, we propagate that information through the call stack by returning a boolean.

Once we’re done exploring all of the nodes in our graph, we’ll be able to return the sequence we built (i.e. res) as our answer.

Given n number of words and k letters in each word, our algorithm runs in O(nk) space and O(nk) time.

With that, we’ve solved the Alien Dictionary problem 👾📕.

Full Solution

def alien_dictionary(words):
    n = len(words)

    # Init nodes
    nodes = {}
    for word in words:
        for c in word:
            if c not in nodes:
                nodes[c] = Node(c)

    # Build graph
    for i in range(n-1):
        for c1, c2 in zip(words[i], words[i+1]):
            if c1 == c2:
                continue
            elif c1 != c2:
                nodes[c2].to.add(c1)
                break

    for label in nodes:
        print(label, nodes[label].to)

    # Run topo sort
    visiting = set()
    res = []

    def topo_sort(label):
        visiting.add(label)

        node = nodes[label]
        for v_label in node.to:
            if v_label in visiting:
                return False

            if v_label in nodes:
                if topo_sort(v_label) is False:
                    return False

        res.append(label)
        del nodes[label]

        visiting.remove(label)

    while nodes:
        label = next(iter(nodes))
        if topo_sort(label) is False:
            return ""

    return ''.join(res)