Parsing syntax using LPeg

LPeg is a pattern matching library for Lua. When I was first introduced to LPeg, I thought it was just another form of regular expression; I found out that it is much more powerful than regular expressions, and can easily match complex patterns that are difficult to match with regular expressions, and even parse syntax. In fact, LPeg is Parsing Expression Grammars for Lua, which is designed to parse grammars. LPeg makes it easy to parse all kinds of grammars, for example in 400 lines of code parsing Lua source code into abstract syntax trees. With it, static code analysis and customization of DSL (Domain Specific Language) will be a snap.

This article will not go into detail about every function and operator in LPeg, which is clear from the official documentation; it will focus on the matching mechanism of LPeg and how to use it. If you are new to PEG and LPeg, you can read this article first, and refer to the official documentation when you need to understand the detailed usage in practice. Since LPeg is a Lua implementation of PEG, let’s start with PEG.

PEG(Parsing Expression Grammars)

When it comes to pattern matching, regular expressions are the first thing that comes to mind for many students. For simple patterns, regular expressions are convenient; however, when the situation becomes complex, regular expressions are out of hand. Can you imagine using regular expressions to match conditional expressions? In addition, regular expressions have efficiency problems, and some strange regular expressions can lead to repeated backtracking, even to exponential time complexity. In order to match complex patterns, we need more powerful tools, and PEG is one of them.

Introduction

PEG was first proposed in a 2004 MIT paper Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. It is very much like CFG (Context-Free Grammars), except that CFG is a description of the language, while PEG is a parsing of the language, a distinction we will see later. The last section of the Lua documentation has a complete grammar described in BNF (Backus Naur Form), which is a representation of CFG. Here is a self-description of the syntax in PEG:

grammar     <-  (nonterminal ’<-’ sp pattern)+
pattern     <-  alternative (’/’ sp alternative)*
alternative <-  ([!&]? sp suffix)+
suffix      <-  primary ([*+?] sp)*
primary     <-  ’(’ sp pattern ’)’ sp / ’.’ sp / literal /
                charclass / nonterminal !’<-’
literal     <-  [’] (![’] .)* [’] sp
charclass   <-  ’[’ (!’]’ (. ’-’ . / .))* ’]’ sp
nonterminal <-  [a-zA-Z]+ sp
sp          <-  [ \t\n]*

As shown in line 1, the PEG syntax consists of more than one rule, each consisting of nonterminal and pattern separated by <-. The next rules indicate what the nonterminal and pattern consist of, in turn, down to the character level. Some of these rules are similar to regular expressions, for example + means the preceding pattern is repeated more than 1 time, * means it is repeated more than 0 times, and ? indicates 1 or 0 occurrences (see suffix in line 4); [] indicates a collection of characters (see charclass in line 8), etc. To eliminate ambiguity, literals need to be placed between '' (see literal, line 7). There are also ! means no match for the immediately following pattern (match succeeds if and only if the subsequent pattern fails), and & means match the immediately following pattern without consuming the input (see line 3, alternative).

As shown in line 2, each pattern can contain multiple alternatives, separated by a slash /, a bit like | in BNF. For example, line 5 represents a non-terminal primary pattern that can be wrapped in parentheses, or a wildcard character . , or a literal, or a character class, or a non-terminal character that is not immediately followed by <-. Unlike CFG, these options are sequential, and only the first option that fails to match will match the next option. Because PEG is used to describe a top-down parsing syntax, the ordered options allow for unambiguous parsing.

Limited backtracking

A major advantage of PEG is that it can restrict backtracking to a single matching rule. Once a selection is made, it is not changed by the failure of subsequent matches. For example, consider the following syntax:

1
2
3

S   <-  A B
A   <-  E1 / E2 / E3
B   <-  ...

When we try to make a string match S, we match A and then B. When we match A, since A has three options, we first try to match pattern E1, and if the match fails, we backtrack, then match E2, and so on. Once any of the options are matched, the rule is not backtracked. For example, if E2 is selected as the match for A, if B fails to match next, then the whole pattern fails, and B’s failure does not cause A to be reselected. This feature ensures that PEGs are efficient and do not have infinite backtracking like regular expressions.

Greedy Matching and Non-Greedy Matching

Students who know regular expressions should be familiar with greedy and non-greedy matching. For example, matching the string abcdXefghXijk , using the regular expression /. *X/ will match to the second X, which is a greedy match, and . * will match as many characters as possible; while using /. *?X/ will match to the first X, which is a non-greedy match, . *? will match as few characters as possible. This is convenient, but not elegant: for example, /. *?X/ , . * is meant to match any character above 0, and X is obviously included in the wildcard . ; just because it is followed by an X, /. *?X/ stops at the first X. /. *X/ is even stranger: it has to stop at the last X it encounters! This approach to regular expressions is intuitive, but logically odd.

The PEG approach is much simpler. PEGs always perform a greedy blind match, that is, they match as many characters as possible, regardless of other patterns before or after them. For example, the following PEG

`1`	`S <- .* ’X’`

It looks similar to the regular expression /. *X/ , but it doesn’t actually match any string. Because . * will match all characters until the end of the string; once it reaches the end of the string, there are no characters to match, so the match fails.

To achieve the regular expression /. *?X/, note that . will match any character, so to make it stop at the first X, we just need to change the match any character to match any character other than X. Use the following PEG:

`1`	`S <- (!’X’ .)* ’X’`

Due to the presence of !' X', the match will fail if X is encountered, so that ! X' . will match any character other than X. Such a PEG is longer than a regular expression, but the logic is clearer.

We can also use recursion to achieve the same effect: the following PEG will do:

`1`	`S <- ’X’ / . S`

The string is scanned sequentially when matching. For each character, it first tries to match 'X' , and if the match fails, it matches . S . This pattern will match any character, and then match the pattern S itself – that is, do the same for the next character. Until the character X is encountered, the match ends.

/. *X/ is a little more interesting. It requires the match to stop at the last X of the string. How do we know where the last X is without scanning the whole string? We need the following PEG:

`1`	`S <- . S / ’X’`

For each character, it will first try to match . S . where . will match any character, and then match the pattern S itself, i.e. do the same for the next character. This continues until the last character of the string:

1
2
3

abcdXefghXijk
            ^
            match `k` with `. S / ’X’`

When k matches . S, the wildcard character . will match the character k, and then the next character - the end of the string - will match S, which will naturally fail. So the optional . S fails, PEG will backtrack and try to match 'X', which will also fail. That means k matches . S / 'X' fails. Will this cause the whole string to fail? Of course not! Note that the whole operation is recursive, and the last character k matches the pattern . S / 'X' , and also the penultimate character j matches the pattern . S / 'X' in S . So the PEG will backtrack again, and so on, until the last character X .

Other mechanisms

PEGs do not require a regular expression like ^ or $ to indicate the beginning or end of a string. First of all, PEG will always match from the beginning of the string; for the end of the string, use the pattern ! , which does not match a single character - this condition is only met at the end of the string.

As mentioned above, & means matching the pattern immediately after but not consuming the input. For example, 'a' 'a' cannot match the string "a", because it requires two consecutive a's, but &‘a’ ‘a ’’ matches it, because when &'a' matches the character a, the input is not consumed and the pointer is not moved back, so the latter pattern 'a' can still match it.

In short, although PEG is less intuitive than regular expressions, its rules are simpler and closer to the essence of pattern matching.

LPeg

LPeg is a Lua implementation of PEG. LPeg does not implement the syntax of PEG, instead, it uses Lua features to implement a set of functions, objects, and construct patterns by overloading operators. Let’s look at some of the basic functions and operations of LPeg:

Operator	Description
`lpeg.P(string)`	Match the literal `string` . Equivalent to `''` in PEG
`lpeg.P(n)`	Match n arbitrary characters
`lpeg.S(string)`	Matches any character in `string`. Equivalent to `[]` in PEG
`lpeg.R("xy")`	Matches all characters in the range `x` to `y`. Equivalent to `[x-y]` in PEG
`patt ^ n`	Repeat `patt` at least `n` times
`patt ^ -n`	Repeat `patt` up to `n` times
`patt1 * patt2`	`patt1` is immediately followed by `patt2` . Equivalent to `patt1 patt2` in PEG
`patt1 + patt2`	Sequential selection. Match `patt1` or `patt2` . Equivalent to `patt1 / patt2` in PEG
`patt1 - patt2`	Only if `patt2` does not match, then `patt1` is matched. Equivalent to `!patt2 patt1` in PEG. can be interpreted as a difference set
`-patt`	Equivalent to `"" - patt` . does not match `patt` . Equivalent to `!patt` in PEG
`#patt`	Matches `patt` but does not consume input. Equivalent to `&patt` in PEG

As you can see, LPeg is much the same as PEG, but in a different form. The methods lpeg.P, lpeg.S, lpeg.R, etc. all return pattern objects, which override operators and can perform various operations with other patterns, the result of which is still pattern. For example, the aforementioned S <- (!' X' .) * 'X' can be written like this using LPeg:

1
2

local lpeg = require "lpeg"
local S = (lpeg.P(1) - "X") ^ 0 * "X"

Calling the match method of pattern matches a string, and returns the position of the end of the match:

`1`	`S:match("abcdXefghXijk") --> 6`

How to implement S <- 'X' / . S recursive pattern? lpeg.P also supports passing in a table. This table contains a series of keys k = v , the keys representing a non-terminator, and the values defining the pattern. Use lpeg.V to refer to other non-terminals. It is also required that the first value of the table table[1] be the initial symbol, since Lua’s table is unordered. For example, S <- 'X' / . S can be written like this with LPeg:

local lpeg = require "lpeg"
local P, V = lpeg.P, lpeg.V

local S = P{"S",
    S = P"X" + P(1) * V"S"
}

Here is a slightly more complex example:

local lpeg = require "lpeg"
local P, V = lpeg.P, lpeg.V

local S = P{"S",
    S = "a" * V"B" + "b" * V"A" + "";
    A = "a" * V"S" + "b" * V"A" * V"A";
    B = "b" * V"S" + "a" * V"B" * V"B";
}

This is equivalent to the following PEG:

1
2
3

S   <-  ’a’ B / ’b’ A / ’’
A   <-  ’a’ S / ’b’ A A
B   <-  ’b’ S / ’a’ B B

Capture

It is too boring to only match strings and return positions. LPeg is much more than that, LPeg has powerful capturing capabilities. Here is a list of LPeg capture methods:

Operation	What it Produces
`C(patt)`	matches the pattern `patt` and captures it
`Ct(patt)`	puts all the captures generated by `patt` in a table
`Cs(patt)`	Takes all the matches in `patt` and splices them into a single string
`Cc(values)`	matches the empty string and treats the given value `values` as a capture
`lpeg.Cp()`	matches the empty string and captures the current position
`Cf(patt, f)`	Pass all the catches generated by `patt` into the function `f` in sequence, similar to the reduce operation. If `patt` produces catches $C_1, C_2, … , C_n$ , then $f(…. . f(f(C_1, C_2), C_3)… , C_n)$ , then $f(… . f(f(C_1, C_2), C_3)… , C_n)$ , with the final return value of the function as the capture
`patt / string`	replaces the result of `patt` with the string `string`
`patt / number`	Take the nth capture of `patt`. If 0, no result is captured
`patt / table`	If the capture result of `patt` is `c`, then `table[c]` is the result of the capture
`patt / function`	Pass the result of `patt` into `function` and take its return value as the result of the capture

It is important to note that the capture result is only generated if the corresponding pattern matches successfully. For example, the pattern lpeg.C(lpeg.P "a" ^ -1) will return the null string when matching a string that does not start with a.

This would be done, for example, by splitting a string by a specified character:

function split(s, sep)
    sep = lpeg.P(sep)
    local elem = lpeg.C((1 - sep) ^ 0)
    local p = elem * (sep * elem) ^ 0
    return p:match(s)
end

where 1 - sep matches any character that is not a separator, then ^ 0 makes it repeat more than 0 times; then elem * (sep * elem) ^ 0 makes the split pattern repeat several times.

If you need to put the result in a table, just write it like this:

function split(s, sep)
    sep = lpeg.P(sep)
    local elem = lpeg.C((1 - sep) ^ 0)
    local p = lpeg.Ct(elem * (sep * elem) ^ 0)   -- make a table capture
    return p:match(s)
end

There are many similar examples on LPeg’s website , see for yourself.

Applications

LPeg is a very powerful tool, so you can manipulate strings as you wish. It can do a lot of interesting things.

Static analysis code

We can use LPeg to analyze syntax. For example, parsing SQL statements to intercept dangerous operations that are not allowed, such as update or delete without where, or certain user interfaces that only allow select statements to be executed. We can use LPeg to parse SQL statements to know exactly what a SQL statement will do, and even check for syntax errors. We can also parse create table statements to get the expected table structure, compare it with the database table structure, check for consistency, and even migrate data automatically.

This repo uses LPeg to parse Lua source code into abstract syntax trees. The entire parsed code is only about 400 lines long. We can use it for static analysis of the code, e.g. to analyze calls to a function, to statically evaluate certain expressions, etc.

Custom DSL

What if you’re out of syntax? Create your own! With LPeg, you can define your own syntax, your own language. For example, in game programming, planners often need to configure triggers. The conditions of triggers are sometimes complicated, such as “50% chance to trigger if blood is less than 10% or rage is more than 90”, which is difficult to describe in a specific format, and not safe enough to let the planner write the code directly. In this case, we can customize the DSL, and the configuration table can be assigned a string, for example, this condition can be represented as a string

`1`	`(hp < 10% \|\| wrath > 90) && random() > 0.5`

The string is then pre-compiled into a Lua function using LPeg, which is called at runtime.

Table of Contents