Regular matching using recursion

A long time ago I read Brain’s article on how to write elegant code. In it, the author tells the story of Rob Pike, who implemented a simple regular matching function in a little over an hour, with just over 30 lines of code plus comments, and whose implementation is not only simple and elegant, but also demonstrates the power of recursive functions and pointers. Today, I’d like to share my understanding of it with you.

Brain gives the following criteria for what constitutes elegant code:

Elegant code is usually concise, clear and easy to understand.
Elegant code is usually short, with just the right amount of code, no more and no less.
Elegant code should be general enough to solve a large class of similar problems.

The simple regular matching function implemented by Rob is a very good example of this.

Regular expressions are a special class of character matching syntax, invented by Stephen Kleene in the mid-1950s. By the mid-1960s, Ken Thompson, the father of UNIX, had added regular expression matching to his own implementation of the QED editor. Subsequent UNIX systems such as ed/vim/grep/sed/awk/prel have been influenced by this to support searching for text using regular expressions.

Brain and Rob also have a strong UNIX background. They co-authored The Practice of Programming (TPOP) in 1998 to give a simple implementation of regular matching, mainly for programming teaching purposes. However, the implementation was quite functional and the code was quite long. For example, the grep code was over 500 lines long and took about ten pages to print on paper.

So Bran suggested that Rob create a simplified version that wasn’t too long, but still demonstrated the core idea of regular expressions, and that the new solution had practical value and wasn’t just toy code. Rob then silently closes his office door, and an hour later Rob has implemented a simplified version of the regular matching program in about 30 lines of C code. He supports the following subset of regular syntax:

c matches any character literal, e.g. c matches character c, b matches character b, etc.
. matches any single character
^ matches the beginning of a character
$ matches the end of a character
* means that the character before it occurs zero or more times

Rob has carefully chosen a subset of these functions that not only cover most string search scenarios, but are also very easy to implement. So to write beautiful code, you need to get to the heart of the matter.

The core code is as follows, with comments totalling 34 lines:

/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
  if (regexp[0] == '^')
    return matchhere(regexp+1, text);
  do {  /* must look even if string is empty */
    if (matchhere(regexp, text))
    return 1;
  } while (*text++ != '\0');
  return 0;
}

/* matchhere: search for regexp at beginning of text a.c */
int matchhere(char *regexp, char *text)
{
  if (regexp[0] == '\0')
    return 1;
  if (regexp[1] == '*')
    return matchstar(regexp[0], regexp+2, text);
  if (regexp[0] == '$' && regexp[1] == '\0')
    return *text == '\0';
  if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
    return matchhere(regexp+1, text+1);
  return 0;
}

/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
  do {  /* a * matches zero or more instances */
    if (matchhere(regexp, text))
    return 1;
  } while (*text != '\0' && (*text++ == c || c == '.'));
  return 0;
}

The entry function is match, which accepts two char pointers, one for the regular expression and one for the content of the string to be searched. The c language uses the []char array to hold the string, with an extra \0 character at the end. In c, you can use the char * pointer to point to a specific location in []char.

The match logic is very simple, and is based on the presence or absence of ^ in the rule. If the regular expression is ^xyz, then xyz will only match if it appears at the beginning of text, and another function, matchhere, is called here; if there is no ^, then the match pattern may appear anywhere in the input string, so it is necessary to iterate through each substring of text to check for a match.

The loop condition in the match function is *text++ ! = '\0', it will keep scanning the substrings of the input string from left to right. If there is a match then 1 is returned, otherwise 0 is returned.

The core matching logic is implemented by matchhere. Here here indicates an in-place comparison, without regard to the start position of the substring. This is because it has already been dealt with in the match function in the previous layer.

matchhere is the typical recursive function and the highlight of this example at its core. The function processes a set of regular units at a time. Suppose the regular expression is ab*c.*$, then it processes them in the following order:

a
b*
c
.*
$

Note that matchhere does not need to handle the case with ^ because the triage has already been done in the match function.

matchhere only processes one set of regular units at a time. If the match fails, 0 is returned and the recursive process ends.

If all the regular units match successfully, then regexp must point to the end of the regular string, which is the \0 character, so it should return 1 and end the recursion. Here an if is used to achieve this.

1
2

if (regexp[0] == '\0')
    return 1;

If you don’t get to the end, there are three scenarios to deal with. Two special cases are dealt with first.

The first is when the regular contains a $ ending, which means that the corresponding pattern must appear at the end of the input. In other words, when the program scans to the last $ pattern, all the previous ones have been matched, and finally it just needs to check if there is any subsequent content in text. If there is, then the match is not at the end and 0 should be returned and the loop exited. So the corresponding code is as follows.

1
2

if (regexp[0] == '$' && regexp[1] == '\0')
  return *text == '\0';

Because the string in c must have the \0 character at the end. The previous judgement has already excluded the case where regexp[0] is \0, so here it is safe to compare regexp[0] == '$', followed by the union condition regexp[1] == '\0' to ensure that only the case where $ occurs at the end is handled. This time, if it occurs at the end of the input string, there will necessarily be *text == '\0'. However, the current recursion is ended whether or not it matches.

Another special case is the occurrence of * in the regular, which needs to be dealt with separately and we put it at the end.

1
2

if (regexp[1] == '*')
  return matchstar(regexp[0], regexp+2, text);

Then the third case is the ordinary case of recursive matching. Here you need to check that text has content and that the character is the same as the character in the rule or that the character in the rule is . so that any character can be matched. If the current character is matched, then the two pointers are moved forward one position each and the recursive matching logic is executed. In summary, regexp and text are both moving towards the end of their respective strings, eventually pointing to \0, so the recursion must end.

1
2

if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
  return matchhere(regexp+1, text+1);

A final word about the matchstar function. It is called with the following code:

1
2

if (regexp[1] == '*')
  return matchstar(regexp[0], regexp+2, text);

If you encounter * in a regular, you need to find the character regexp[0] before it, and then call matchstar separately to check for a match. If the match is successful, then you need to continue recursively from the regular after *. So the second argument here is regexp+2 to skip * and the character before it. Let’s look at the code for matchstar.

int matchstar(int c, char *regexp, char *text)
{
  do {  /* a * matches zero or more instances */
    if (matchhere(regexp, text))
    return 1;
  } while (*text != '\0' && (*text++ == c || c == '.'));
  return 0;
}

A do-while loop is used here, starting with a matchhere match, because * means that the preceding characters can be left out. If the match fails, the current text pointer is checked to see if it matches the current regular unit, and each match recursively calls matchhere to check if the subsequent pattern matches, moving the text pointer around. So once matchhere enters matchstar it doesn’t return until all the checks have been completed.

The above is all of Rob’s code. In Brain’s words the implementation is compact, elegant, efficient and useful. This is a great example of the appeal of recursion and pointers. I think so!

Of course, Rob’s implementation is obviously not perfect. For example, his implementation of * matching is a non-greedy match. Sometimes we need greedy matching, i.e. matching as long as possible, which could be changed to the following code.

int matchstar(int c, char *regexp, char *text)
{
  char *t;

  for (t = text; *t != '\0' && (*t == c || c == '.'); t++)
    ;
  do {  /* * matches zero or more */
    if (matchhere(regexp, t))
      return 1;
  } while (t-- > text);
  return 0;
}

Here the for loop first moves the current pointer to the rightmost character matching the current regular cell, and then checks leftwards in turn for matching sub-patterns. The greedy matching algorithm is thus implemented.

If you just want to enjoy the elegant code of a master, then this is the end of the reading. But if you are a beginner, then the following exercises are recommended.

The first exercise is to implement the +, ? syntax. This is relatively simple and can be achieved with a little fine-tuning involving matchstar.

The second exercise is to implement the \. syntax, which is a bit more complicated, as Rob uses the normal char [] to hold the regular and treats . as a special character, which is a pain to deal with. If you want to implement a syntax like [abc] or [0-9], you need to use a new data structure. One possible data structure is given in the original article for the reader’s reference:

typedef struct RE {
  int     type;   /* CHAR, STAR, etc. */
  char    ch;     /* the character itself */
  char    *ccl;   /* for [...] instead */
  int     nccl;   /* true if class is negated [^...] */
} RE;

This example also inspired me in another way, that c pointers are really powerful and efficient. I have also seen Ben’s version of the Go language implementation. Because Go can’t do things like *text++, he has to use string to simulate it. And string needs to be copied every time it is passed, which creates extra memory consumption.

Finally, the main function code is attached for your testing purposes:

#include <stdio.h>
#include <string.h>

int match(char *regexp, char *text);
int matchhere(char *regexp, char *text);
int matchstar(int c, char *regexp, char *text);

int main(int argc, char *argv[]) {
    if (argc < 3) {
        fprintf(stderr, "Error: argument\n");
        return 1;
    }

    FILE *file;
    char buffer[1024*1024];
    file = fopen(argv[2], "r");
    if (file == NULL) {
        perror("Error opening file");
        return -1;
    }

    while (fgets(buffer, sizeof(buffer), file)) {
        buffer[strlen(buffer)-1] = '\0';
        if (match(argv[1], buffer)) {
            printf("%s\n", buffer);
        }
    }

    fclose(file);
    return 0;
}